IV. Infrastructure



While pencil and paper are the only materials required for much theoretical research in economics, computational methods require the right hardware and software. Moreover, the hardware now includes not only the machines but also the high-speed networks that link them. And, in broader terms, the infrastructure even includes programming assistants, graduate student training and digital journals.

Some of the recommendations in this section of the report apply to the National Science Foundation and some apply to the economics profession as a whole.

The following sections discuss some of the key components of the computing infrastructure in economics.

A. Software

One area of significant software under-investment is that of interfaces between various statistical, mathematical, and graphical tool kits used by economists. The state of computing software in economics resembles the characters of the Ray Bradbury novel, Farenheit 451, where each researcher is expert in one of many distinct computing environments, such as Gauss, GAMS, Mathematica, RATS, S-plus, SAS, SST, Troll, or TSP. On a given project, many researchers would like to transfer data between windows in separate environments to take advantage of specialized features in different packages. An all-purpose software research environment is not a reasonable concept, but it would be useful to encourage development of software interfaces between existing environments. One mode of design that might help is object-oriented analysis software, such as that being developed by Oldford (1988), where inherited characteristics, such as the method of data construction (modeling), are carried along with the data. Object-oriented coding could help alleviate a key problem with imported software, namely, that the output produced by the imported algorithm has to be re-connected to the remainder of the user's software analysis tool kit.

One byproduct of the lack of adequate interfaces among alternative software environments is the notable inefficiency of duplicative coding of research innovations in mathematical and statistical analysis (a recent case in point is the coast-to-coast recoding of the Johansen algorithm for co-integration analysis). Although one can rely on vendors to eventually incorporate new wrinkles in computational analysis, the time lag is considerable. Whether or not this is viewed as a market failure may depend on the marginal cost of graduate students' time, but we suspect the translation lag slows disinterested scrutiny of algorithms that are the basis of published work. Although national software banks, analogous to national data banks, are a possibility, perhaps growth of traffic on national networks by economists will provide a solution.

One software problem that will not be alleviated by more network software swapping is a need for economists to develop standard modeling protocols and nomenclature. User agreements for modeling standards could encourage development of software modules in high-level languages that can be used by the variety of computing systems adopted by economic departments and agencies.

Finally, many of us feel that software availability is an important bottleneck and that NSF should encourage investigators to add software items to their grant request rather than to discourage them as is the current policy.

B. Visualization

Visualization has been used principally by non-social scientists with applications such as the visualization of molecular phenomena, smog in Los Angeles and the effects of alternative policy interventions, and visualization of patterns of fires, with simulations of what the landscape would look like at various points in history in the absence of fires. This novel technology would bring another dimension to the understanding of complex phenomena in economics in a dynamic setting. Moreover, it would enable the researcher to recognize patterns and trends not apparent from conventional data analysis, and to actually see the evolutions of economic systems. Further, it would be a graphic way of representing questions raised by the discipline and of providing answers to researchers, laymen and policymakers alike. It, inevitably, will also suggest new theories.

This tool could also be used to visually measure the performance of algorithms and to explain their behavior to individuals not necessarily schooled in mathematical programming fundamentals. Economics, with its fascinating problems, should take the initiative in this direction.

C. Unix-based Networks and Workstations

The organization of computing in most economic departments is driven by the selection of either mainframes or PCs, or some mixture of both. On the one hand, mainframes are generally well- suited for SAS manipulations of large-scale data sets. Similarly, PCs appear to be adequate for small-scale macro modeling, have better graphical capabilities for pre-test evaluations and examination of estimation surfaces, and avoid the hassles of petitioning for larger departmental allocations on university mainframes. In addition, business schools may prefer to prepare students for the spreadsheet computing that graduates will face in the business world.

On the other hand, in adopting this either/or setup, students and researchers in economics are generally shut out from a wide midrange of hardware power and scientific computing software options that are available to users of Unix-based workstations and servers. Perhaps most damaging, the lack of access to the Unix environment deprives economists from low-cost imports of computing technologies developed in other scientific computing disciplines, ranging from freeware editors for multi-tasking to experimental software for large-scale optimizations and simulations.

Given the reduced prices of workstations and servers and the ability to establish networks of relatively inexpensive terminals (with varying degrees of local power) that can access more expensive compute servers, it is hard to believe that this state of affairs is only a consequence of sunk costs. Many colleges and universities encourage or subsidize student-owned personal computers, which could be attached to departmental Unix-based servers. However, the need for skilled maintenance and pooled resources in a Unix network does require overcoming bureaucratic inertia and incentives for noncooperative behavior.

The previous sentence may seem benign and the reader may rush across it easily. However, we urge a full stop here. In our experience, workstations are wonderful to use but difficult to set up and to maintain. Single users should not look at their PCs and think that a workstation will be only slightly more difficult to manage. Rather they should think of their workstation as belonging to a cluster around a server, and there should be a skilled and permanent staff member to support the entire set of machines.

D. Distributed Processing and Parallel Computing

One of the first things late-night users of a network observe is the underutilization of processor capacity on other network machines. Our impression is that network-distributed processing, a special case of parallel processing involving ten or twenty network processors, is not employed by economic departments and should be encouraged.

At a minimum, a mechanical adaptation of distributed processing is cost-effective and would permit even small departments to approach cranking speeds usually available only to privileged users of array or vector processors. Thus far, experience at the Federal Reserve Board of Governors is mostly limited to spawning independent processes, such as Monte Carlo runs to generate confidence intervals for staff model forecast production runs. In the case of independent processes, the speedup (the ratio of serial processing time to parallel time) is approximately proportional to the number of parallel processors, so it is relatively straightforward to obtain an order of magnitude speedup.

In the case of non-independent processes, the concurrent message passing associated with distributed processing is slowly altering appreciation of modeling systems, in contrast to a single fixed model specification. For example, to examine the robustness of policy options to alternative model specifications, different sectors from domestic and international macroeconometric models can be combined by message-passing simulations among various models. The causal structure, or the identification of "exogenous" and "endogenous" variables in a particular sector, is determined by the lists of variables imported or exported by particular sectors of the component models. In this manner, a variety of models can be generated by a small set of competing specifications. Also, alternative sectors, as might be constructed by a specialist, can be embedded in a larger simulation environment. Other obvious extensions include numerical analysis of differential games by agents with limited perceptions.

In theory, it is relatively straightforward to extend parallel computing to numerical analysis problems such as gradient methods of nonlinear optimization. However, in contrast to a range of accessible libraries, such as NAG, that contain pre-coded numerical modules for serial processing, we am not aware of a software library for network-distributed processing implementations of standard numerical algorithms.

To suggest an example of a more conceptual transfer of the structure of parallel computing to economic theory, an attractive application would be more explicit analysis of how heterogeneous agents learn or how markets generate transaction prices. In contrast to the concept of a representative agent with a single mode of learning, it would appear to be more realistic to specify that agents operate in parallel, with a variety of learning heuristics, and with restricted inter-agent messaging but periodic communication with a central message center that provides averaged market characteristics.

Although parallel computing at this point does not yet play a major role in economics, it represents a technology whose potential usefulness and power cannot be ignored. Effective exploitation of parallel architectures is still very time-consuming, although coarse- grain architectures such as the IBM 3090-600 series, are, in a sense, easier to use, than massively parallel architectures such as the Thinking Machines CM-2 machine. Nevertheless, economists merit the support that permits such investments, since much can be gained by rethinking algorithmic procedures from the parallel computational standpoint. Of course, emphasis should be placed on parallel algorithms which are useful in a variety of problems in econometrics, economic theory and applied economics. Finally, there is now an excellent text by Bertsekas and Tsitsiklis (1989) which lays the foundations for such algorithm development, albeit in a general setting.

E. Networks and Supercomputers

John Rust has been involved in a 4-year study of the retirement behavior of older men. He is trying to see whether the mathematical principle of dynamic programming (DP) can provide a good representation of their behavior. Since the DP model is very complex, he used a supercomputer to solve the model and compare its predictions to observations on a very large data set on the actual retirement behavior of 8,131 men followed over 10 years. This is an interactive process which involves repeated transfers of large amounts of data (up to 10 megabytes per model) between a supercomputer and an IBM 386 workstation. To date he has transferred close to 400 megabytes of data over the Internet, and he expects to transfer roughly 20 times that amount before the project is done. Some of the data is "raw" input data prepared on his PC for transfer to the supercomputer for estimation, and part is output data generated on the supercomputer, which he transfers back to his workstation for graphical post-processing (even the Internet does not yet have sufficient bandwidth for real-time, interactive graphical analyses). Without access to the Internet, he would be forced to use the supercomputer via a dial-in line and a modem, at much slower rates. Transfer of a 10 megabyte data file by this method would take approximately 12 to 15 hours, effectively making it impossible to interactively evaluate his models. With the Internet, he can transfer these files in less than 15 minutes. Thus, without the Internet, it would be virtually impossible to fully analyze his supercomputer output, and he would have to severely cut back the scope of his project.

We can offer an even more compelling example of how inadequate communications can seriously hinder use of supercomputer facilities. In 1989, IBM granted Rust 200 cpu hours on their 3090 supercomputer as part of their "Research Support Program." For security reasons, the IBM 3090 (located at IBM's Palo Alto Scientific Center) was not accessible via Internet, but only over a special call-back modem on a dial-up line. IBM went to the expense of connecting a dedicated phone line to Rust's office so that he could use the 3090 without tying up his office phone line. Despite the use a clever terminal emulation software designed by IBM, the transfer rate of a 1200 baud modem is too slow to move files larger than 200 kilobytes in a reasonable amount of time. IBM recognizes this problem and provides a standard $5,000 travel budget to all research support program participants so they can fly to Palo Alto to install and remove large data-sets on site. However from the perspective of NSF, it is probably cheaper in the long-run to support fast network access over Internet than to pay for the costs of dedicate phone lines and the travel expenses to physically transfer datasets over disk or tape. In terms of research turn-around time, it is difficult to see how one could do large computational tasks in a reasonable amount of time without a medium such as Internet. Rust found that the slowness of modem access to the IBM 3090 severely limited the types of projects he could undertake.

1. Long distance research collaborations.

The Internet is perhaps most useful on a day-to-day basis to facilitate joint research with colleagues around the world. Using a telephone is prohibitively expensive and also inconvenient since it is often difficult to reach busy colleagues or those in different time zones. The key advantage of Internet is the ability to jointly work on papers or programs, sending electronic copies of the paper back and forth in the process of revisions. John Rust's and his colleagues' work on developing double auction software shows that a team of workers can closely collaborate on software development even though they are separated by thousands of miles. In the Double Auction project, Richard Palmer was located at Duke University, John Rust was at Wisconsin and John Miller was at the Santa Fe Institute. Despite the distance, they managed to develop over 15,000 lines of code, and a 100-page participant's manual entirely via Internet communications.

Another example is Rust's paper "Dynamic Structural Models: Problems and Prospects" written jointly with Ariel Pakes at Yale University. Rust and Pakes were under very tight time deadlines in writing up the paper, and they cannot image how they would have done it by exchanging trial drafts via Federal Express. Certainly it would have been much more costly and would have severely cut back the number of revisions they would have attempted.

Internet is very useful for a number of other activities, including: 1) jointly-organized conferences, sending out program updates and confirmations via E-mail, 2) using E-mail to speed up sending in referee reports for journals and NSF, and 3) obtaining free software over electronic bulletin boards. In the latter case, one can obtain a vast array of free software that would cost thousands if one were to purchase similar proprietary programs through standard vendors.

Our general observation is that Internet has been a significant factor in promoting not only our own research, but that of many of our colleagues. Last year, the University of Wisconsin spent $12 million dollars on new wiring and network equipment to take advantage of the Internet system. Clearly, the research community is coming to depend on Internet: it is no longer a luxury but a necessity for many researchers. Several years ago, John Geweke suggested that investment in the Internet might be a substitute to further investment in supercomputers, since it provided a way of tapping unused computer power on researchers desktops throughout the country. John Rust thought this was somewhat preposterous at the time; however, given his subsequent experience with the amazing speeds of (certain segments of) the Internet and the recent development of distributed computing software, this idea no longer seems preposterous but eminently practical. Currently, one of the main obstacles to practical implementation of this ideas are concerns of security in the wake of recent break-ins by computer hackers, and destruction of data by computer viruses and worms.

2. Supercomputers

While distributed computing arrangements can offer great advantages, for the very largest problems they are inadequate. For example, Rust has a problem involving arrays of several to hundreds of millions of elements. Most workstations have only about 8 to 16 megabytes of memory and 200 to 300 megabytes of disk storage. Solving a large-scale dynamic programming problem in a distributed computing environment would be infeasible because small sections of the problem would have to be distributed to many machines, requiring considerable programming effort to figure out how to divide the arrays, distribute the data to the individual machines, and synchronize the computations so that each component computation could be re- assembled to produce the overall solution. There simply is no software that allows one to do this automatically, so the time required to write the software would be considerable (and probably rife with errors), and the ultimate throughput of such an arrangement would still be significantly less than supercomputers such as the Cray-2 that regularly achieve rates of 400 megaflops or better.

Thus, we would argue for a balanced approach to economic computing. Both workstations and supercomputers will be needed, and a fast network infrastructure will be the glue that binds them together.

F. Infrastructure in Economics and in Statistics

The hardware-software situation in economics is curious. Statistics is firmly part of the "hard science" and engineering community and culture, using Unix-based workstations, tex word-processing and the Unix-based statistical package S. Use of Internet is widespread. With a few exceptions, economists use PCs (rarely reliably tied into Internet), a wide array of word- processors and several statistical packages of which Gauss may be the most commonly used. The computational infrastructure in the statistical community is much better suited to computational economics than is the computational infrastructure of the economics community itself. We could elaborate on this at length, but will refrain from doing so here. In part, this situation can be traced to the suspicion that computational work is not highly valued in the economics community, but this is changing very quickly. More important is the fact that statistics departments are often on the "hard science" side of campus, and administrators who are accustomed to providing over $100K for new assistant professors in the lab sciences will see their way to spending $10K per active researcher to install, maintain and upgrade distributed computing in statistics (the level identified in the "Eddy Report" in Statistical Science a few years ago). Most important is the fact that statisticians have been active in identifying their needs (see the Eddy Report) resulting in, among other things, their active participation in the ideally leveraged Scientific Computing Research Equipment in the Mathematical Sciences (SCREMS) program at NSF. The characteristic of workstations and associated software that is important for computational economics is the ability for very rapid communication between researchers, and the ability to run software essentially interchangeably. The maturing of computational economics will lead to a system with these same characteristics, and our opinion is this will entail the adoption of workstations by economists. NSF should encourage and facilitate this movement in the same way it has in the mathematical sciences.

Some economists have found the economics program at NSF very responsive to their requests for hardware and software support in grant applications, providing flat amounts each year for maintenance and upgrading of a base funded originally through combinations of SCREMS and university money. However, many economists who do very good work and are regularly funded by NSF-economics report routine cutting of hardware and software requests.

It is our opinion that, on balance, NSF has been a follower rather than a leader in computational economics. In some areas, being more of a leader means saving money. A clear example is the archiving of data, which ought to be extended to software. When one sends software to a typical economist, one mails a PC floppy with code and quite a few pages of documentation, asking the person on the other end to return the disk. A secretary often handles this. When one sends software to a statistician, who makes a request by e-mail, one transfers the files on Internet which takes about a minute. NSF could do a lot to encourage (to the point of requiring) funded research to use Internet-accessible machines. This would have little or no impact on NSF's budget, but would appropriately place great pressure on campuses to upgrade their wiring.

G. Databases

One welcome development in applied macroeconomics in the 19809s was the move to subject macro theories to microdata analysis. The Center for Research in Security Prices, the Longitudinal Research Database at Census, the Manufacturing Sector Master File at the NBER, and the Surveys of Consumer Finance by the Federal Reserve Board are only a few of the sources of high-quality, detailed data for researchers. Four generic issues worth mentioning as areas for future funding are: (1) network access to standard archival data banks; (2) reduction of barriers to public access of disaggregated data, such as aggregations that preserve information but protect the confidentiality of reporters; (3) user-friendly concordances among alternative data bases, including links to corresponding foreign industry data banks; and (4) establishment of time-dated or "as of" data bases, where successive measurements of events are preserved. On the latter point, access to "as of" perspectives that indicate the calendar timing of data revisions would help identify information actually available to agents in historical time.

In these preliminary comments on databases, we are less concerned with specific areas of research than we are with the components of a good program in computational economics. Such a program can only be developed through a unified or balanced approach emphasizing the three basic foundations of empirical research. These foundations of good empirical economics are: 1) a solid theoretical base and econometric methodology, 2) appropriate data, and 3) accessibility of the data to a wide variety of researchers. Although we will focus our remarks here on the data component, we emphasize that a unified approach to an initiative in computational economics is needed.

1. Theory and Econometric Methodology

The first need is for a theoretical framework rich enough to allow for estimation of basic structural relationships among economic agents. Here we are thinking about establishments, firms, individuals and households as the economic agents of interest. The models developed must be rich enough to provide for stochastic outcomes, incorporate equilibrium concepts that enable researchers to evaluate the impacts of spillovers on behavior, and provide for efficient use of both cross-section and time-series variations in agents behavior. For example, there is a growing literature on stochastic models of industry evolution such as the one recently developed by Pakes and Ericson (1988). To estimate such models, longitudinal data on individual agents is necessary. Development of longitudinal micro databases are possible and, to a limited extent, already available. (See Olley and Pakes (1991) for a successful implementation of a dynamic stochastic model with endogenous exit behavior using the Longitudinal Research Database (LRD) at the Census Bureau.)

2. Economic Data

It is not enough to develop models. (Parenthetically, we note that the existence of appropriate databases will tend to spur theory development. We think the expansion of industrial organization data in the 1960's directly fed many theoretical advances in the 1970's and 19809s.) In order to make progress on answering basic policy questions, these models must be parameterized and "experiments" with different policy instruments undertaken. This requires microdata panels. Such panels are possible to create, but they will require substantial matching across existing databases, extensive editing and long-term commitments of resources, including mechanisms for feedback between the entities collecting data and the information developed from the matched database(s). In addition, since the data involved will be subject to confidentiality restrictions, methods for researcher access will need to be a part of any plan.

A comprehensive initiative will seek to bring together establishment and firm information on outputs (products and services), inputs (capital, labor, R&D, materials, purchase services), financial and management characteristics, and demographic information on workers (work histories, education, training, etc.). We are stressing a rather broad view of a database initiative for a couple of reasons. First, we think such an effort is possible, although we recognize that political and other realities may suggest a less grandiose design. Second, we think that many problems will require elements of both demographic and establishment data for resolution. For example, the production function is basic to most economic theories. It requires information on input flows that are in turn derived from estimates of capital, both human and physical. If we want to get correct answers we must develop the ability to measure the basic variables in the underlying structural equations of our theories.

One might ask why the emphasis on micropanels in these comments. The answer is that, increasingly it is becoming apparent that for a wide range of problems the assumptions that allow one to use aggregate data to draw inferences about the behavior of individual economic agents are not valid. Whether the application of the data is productivity measurement, business cycle analysis, R&D resource allocations, job turnover, minority and small business development, environmental or merger policy, the stylized facts suggest that it is essential to take account of the heterogeneity of economic agents. For example, economic studies at the Census Bureau are finding that the basic building blocks of the statistical system show substantial variation along a variety of dimensions and these variations are not constant over time. As outlined in McGuckin (1989) and numerous Center for Economic Studies produced papers, this has potentially important implications for economic measurement. In particular, the size and nature of observed heterogeneity call into question the use of, for example, simple fixed-effects models of economic performance utilizing aggregate industry data based on a "representative firm" as a basis for measurement and policy analysis.

Even at very detailed levels of aggregation, such as the 4- and 5- digit levels of industry and product classifications, there exists substantial heterogeneity among establishments and firms along a wide variety of dimensions. This heterogeneity in economic behavior is associated with observables such as age, size, location and ownership of the plant. Nonetheless, it is also true that even after controlling for these observables, there remains substantial residual heterogeneity in the behavior of establishments. A similar story can be told from the perspective of wage equations estimated with supply-side demographic data. Moreover, some forms of behavior that often are not treated as endogenous, such as ownership changes, entry and exit, migration of production and distribution systems, and regulatory changes all interesting phenomena in their own right -- have all been shown to effect the behavior and performance of firms and establishments. Thus, not only is there heterogeneity of the behavior of economic entities at a point in time, also the nature of heterogeneity is changing over time. Thus, we place an emphasis on micro-panel data.

3. Accessibility

It is important that a wide array of researchers have access to the data both for replication purposes and tests of alternative models. This is a major difficulty because of the inherent confidentiality of data on individual agents. Most official economic data publications are based on aggregations of microdata collected in statistical surveys of individual respondents. These data are used by policy makers, researchers and market analysts as economic indicators and as a source of information for developing economic policy and testing economic theory. As useful as these aggregate data are, the underlying microdata provide even more valuable information for the study of the economy. Many hypotheses concerning the nature of production, technical change, and the interaction of individual firms can only be tested using detailed microdata. For example, John Solow (1987) argues convincingly that, using aggregate data, it is impossible to determine whether energy and capital are complements or substitutes. Moreover, the extent of aggregation bias can only be evaluated with the use of microdata. As a result, the necessity for use of detailed microdata by the public and private research communities cannot be minimized.

Faced with this, statistical agencies such as the Census Bureau have sought ways to make microdata available to outside researchers and policy makers without violating confidentiality commitments to individual respondents. Aside from cost and legal issues, the confidentiality commitment to respondents is of great concern because statistical officials fear that low rates of response to statistical surveys will get lower if the released microdata reveal confidential information about individual respondents.

All masking techniques create surrogate data by adding either stochastic or systematic (or both) measurement errors to the data. In turn, undoing or correcting for such errors can only be accomplished within the context of specific econometric models. Put differently, evaluation of the effects of measurement error or parameter estimates depends on the model describing the relationships among the variables associated with the unmasked data. Thus, determining the usefulness of a public use data file is essentially a problem in evaluating the effects of measurement error.

It would be convenient to have one public-use file that could provide researchers with sufficient information to test hypotheses and estimate models, while maintaining confidentiality protection for respondents. Unfortunately, the masking techniques used to preserve confidentiality limit the economic studies that can be carried out with any particular public use data set. (See McGuckin and Nguyen (1990)). Thus, it is extremely unlikely that any single public use file will satisfy all users, particularly for data on business populations that are highly skewed. Ideally, it would be best to release many different files to satisfy the needs of different researchers. But this complicates disclosure analysis because the release of a particular public use file often makes it possible to identify individual respondents in another file, that by itself would not reveal confidential information.

These issues are of clear importance to economists. Yet, the confidentiality issues are not widely understood, and there has been little research on the subject even within the statistical community. One approach in use at the Center for Economic Studies of the Census Bureau is to have outside researchers visit as special sworn Census employees with accessibility to the microdata under controlled conditions. In fact, this mission is a major component of the Center for Economic Studies. Costs of such access cover, among other things, security agreements and disclosure analysis. Unfortunately, for some projects these costs may be substantial when viewed from the perspective of the individual researchers without grant support. Procedures for data access must be considered as part of any comprehensive initiative in computational economics.

Currently the Center for Economic Studies is working on proposals to link workers to Longitudinal Research Database (LRD) plant level data, environmental emissions to LRD plant level data, non-manufacturing plants longitudinally to make the LRD truly economy-wide, as well as a number of other smaller efforts.

4. An Economics Server

An argument can be made that much of U.S. economic statistics are still organized in a form which was appropriate to the econometric models of twenty years ago, but is completely out step with the possibilities which are available now for databases maintained on high-speed servers which are accessible over networks of workstations.

It would seem worthwhile to organize our data so as to permit the user to impose his or her own view on the data and then retrieve it with the type of aggregation needed for the work at hand. While organizational lines are very important to those who create and maintain economic data, they are frequently not important to users. Therefore, relational database should be organized to let the user access the data in ways which are independent of the organizations which create and maintain the databases.

This could be accomplished by the creation of a National Economics Server (NES). This server would provide an interface between data users and suppliers. It would have a menu-driven interface that would make it simple for the user to download data without having to be concerned about the agency which maintains each portion of the desired data. The economic data as such would not be stored on the National Economics Server, but rather links would be maintained to the organizations which maintain major economic databases. A single account could be maintained on the server by the user, and the server organization could then pass along the appropriate fees to the organizations which maintain the databases.

In the future we will want to have databases of models as well as of numbers. The NES might also maintain a database of economic models and the documentation which supports them. Thus if a user were interested in models of the economic effects of global warming, he or she could access the models and their databases and build on the existing work.

Because science depends on reproducible results, it is extremely important that the manipulations undertaken to develop estimates are documented in a machine-independent form. Query languages offer a platform independent way of describing data manipulation. They also use a standard vocabulary of verbs, instead of "cute" or "ingenious" code that is not easily interpreted by persons attempting replication.

In addition, replication demands attention to the naming of the measurement objects. If different researchers use different names for the same measurement, scientific discourse is obscured. If different scientists use the same name for different measurements, anarchy results. Economists need a standard nomenclature for measures, just as much as chemists. Databases enforce permanent nomenclature on the source of measurements and standardize labelling used in computing derivatives from a particular data source.

The following sections are particularly concerned with information systems for complex data and with the supporting computational hardware and software.

5. Complex data

This kind of data has been growing in importance and potential over the last twenty years. Panel data (Panel Survey of Income Dynamics - PSID), matched observations from clients and administrative records (Wisconsin Assets and Income Studies), designs with multiple units of analysis (High School and Beyond), data from the Social Experiments (Seattle-Denver Income Maintenance) are all examples of complex data. Such data are characterized by: 1) group design, 2) parallel exploitation, 3) extended periods of data collection, and 4) alterations in design during that period.

Group design in complex data implies that no single scientist is fully informed about the design in the absence of good internal communication within the organization responsible for data collection. Parallel exploitation of the data implies that scientists other than the collectors have access to the data. Secondary users need design and process information that is used by the collectors to affect their design. Extended periods of data collection imply that information about a changing design must be archived for use by a younger cohort of scientists at a later date.

The task of scientific journalling of the activities that generate complex data and collating the knowledge that is generated from the data is beyond the capacity of traditional social science and its journals. Repeated failures demonstrate the inadequacy of current institutions for supporting secondary users and maintaining archival data (David (1980)).

6. Exploiting complex data

Fortunately, we can overcome errors of the past. Technology and knowledge have changed the cost-effective mode for exploiting complex data. At the same time they have created an environment in which researchers can be more strongly linked to data producers. The new technology creates opportunities for scientific collaboration that furthers the agendas of both scientists and policy makers.

The technology and knowledge that create a new mode for exploiting complex data contains three elements: 1) high-capacity communication networks, 2) relational database management systems, and 3) "necessary support" (David (1991)). When these elements are augmented by a specialist "expert" and adequate computational capacity in an institution that provides incentives for cooperation, the combination makes low-cost access to data feasible.

Data, relationships, metadata, and aggregates of data are a complex object that must be transmitted in its entirety if it is to be used scientifically, David (1991). An objective for databases in use in economics is to implement standards for that complex object. Relational databases come closer than any other tool we now have to realizing the objective of working with a complex data object.

Special-purpose programming to optimize on computer cycles is not efficient in an economic sense if the integrity of the data object is lost. The inefficiency comes about because many potential research questions go unasked or unanswered, as users do not have an understanding of the special purpose programming system. The relational model provided by the computer science community gives a common basis that can be taught and learned widely, then applied to a broad range of problems. The special purpose system does not meet that criterion.

We anticipate that improvements upon the present relational model for data will be forthcoming, due to experiments now going on with object-oriented databases, temporal databases, and interfaces between hypertext and databases. Nonetheless, these improvements will necessarily retain much relational structure, and it behooves the profession to begin organizing relational databases, since future extensions are unlikely to be incompatible with the improvement in data access that is already technically possible.

7. An Information System for Complex Data

It is precisely this combination that was created for five years to foster research on the Survey of Income and Program Participation (SIPP) (David (1985) and David and Robbin (1989)). National Science Foundation support for this project recognized the pressing national need to develop solutions for storing, accessing and retrieving data from very large, dynamic statistical data sets with complex designs (Fienberg, Martin and Straf (1985) and Aborn (1988)). SIPP ACCESS designed an information system that integrated statistical data and metadata (information about the data), including the database design and contents, survey design, collection and processing procedures, and the results of data analysis. More than 2.2 gigabytes of data released as cross- sectional public use files were reduced by about 75 percent using relational database management system (RDBMS) software. New data structures were designed to facilitate an understanding of the complexity of longitudinal panel surveys, reduce conceptual errors, and obtain large reductions in the amount of time required to prepare data for longitudinal panel analysis. Technical memoranda reflecting the scientific design and processing at the Bureau of the Census were collated and catalogued to create the only complete record of the data collection activity.

The foregoing activities reduced the cost of research access to these complex data dramatically. Learning time, assembling necessary documentation, computing resources and the cost of mistakes in research procedures were reduced by two orders of magnitude.

At the same time, the facility created a node through which social scientists could maintain active communication with the data collectors and with each other. This resulted in:

The users of the facility could, and did, exchange contextual data, solutions to missing data problems and background information through the network.

The facility has served 45 research projects nationwide. More than 150 faculty, graduate students and programmers in universities across the country; policy analysts in private, not-for- profit research institutes; and members of federal agencies have been trained in SIPP ACCESS workshops. They have learned the SIPP design, how to apply relational database management systems (RDBMS) to complex data, and how to use the new laser disk storage and communication technologies.

8. Resource Requirements for a Database Program

To multiply the number of databases that are handled in information systems for complex data requires:

a. National access to high capacity communications

For major research institutions this already exists in the form of NSFnet. The need for high capacity is to permit downloading of data from host nodes distributed over the country to researchers who may be resident at small institutions. Centralized storage has the advantage of ease of updating the database and assuring concurrence of data in use by the research community. Any decentralized dissemination of databases entails high transaction costs to monitor versions of the database in use and distribute newer versions. This is an issue of consequence because of the continuing discovery of error in what might otherwise appear to be static databases.

b. Resources for the design of database "schema" and the overlying information system

Every scientific data collection has its design, and the appropriate database structure must conform to that design or needless human error and excessive costs of "learning" will characterize utilization of the database. The schema will need to include meta-data and aggregates.

c. Maintenance, including "system-level" adaptation to software and hardware evolution and commitment of a professional "expert"

Because hardware and much of the database software are not built by social scientists, they must react to the capabilities that are offered to a mass market that is not oriented to research, i.e. the answering of one-of-a-kind questions. At the same time, the prospect is that systems now being sold offer more capability for dealing with objects, definition of data-types, and checking of integrities which should serve to increase quality and control over data.

Given the cost of designing a DBMS and adapting it to alternative computer architectures, it would be fool-hardy for social science to "reinvent the wheel" by creating its own system. Social science should lobby for adaptations of existing systems and create "applications programs" if professional agreement is reached on the capability required. The number of competing systems and the non-standard languages used, despite the SQL standard, imply that social scientists need to coordinate development of databases for related complex data sets. Social scientists also need to be aware of the interfaces that exist between database systems and statistical processors which have a comparative advantage at sweeping through rectangular arrays, but which do not generally support set-theoretic operations on relations and which do not offer the protections of concurrence, integrity and dynamic independence that are the forte of "truly relational" RDBMS (Date (1988)).

d. Teaching resources, to learn the "database command syntax" and the "language of the embedded data"

Learning how to use information systems for complex data will require training, just as the science of econometrics has required training. The database vendors have not yet settled on a standard language that is transferable across vendors, nor have social scientists agreed on a "canonical form" for information systems. For the near future, several variants on the theme will be used, just as with statistical processors.

e. Storage media for data, text files and bibliographic databases

This may appear obvious, but the implementation here is not. The ideal would be "virtual storage" that encompasses near-universal scope in terms of existing social science data. Real storage would need to be random access and off-line to be economic. (The SIPP-ACCESS system worked successfully with such a virtual system, but the technology used is already obsolete.)

f. Institutional commitments to assure egalitarian policies for access, security for confidential data, and the basis for continuing relationships supporting data producers

Because complex data will increasingly involve researcher access to data in which individuals or organizations are potentially identifiable, security rules, bonding and criminal liability may be associated with data access in the future. To be fair to individuals, agreements to use such restricted data must be co-signed by the administrators who in the last analysis control security and resources in their institutions.

One might ask if the resources required for the complex data program outlined above are large. Compared to solo investigator projects, the costs of retrofitting information systems onto data acquired in conventional ways are large. However, this comparison is fallacious. First, the costs of conventional solo investigator projects will fall if database technology is used. Economies of scale are substantial; the second and subsequent analyses have lower marginal costs. Also, learning about data will be more efficient, replication will be less costly, and transformation of data to particular conceptual models will be more efficient. Second, the quality of data and the quality of analysis will rise because error can be more systematically studied, more completely detected, and eliminated from the data source and the estimates. Lastly, processing of data that is required for the capture of administrative records and statistical protocols can be much more efficiently accomplished by database management. The real possibility exists that databases dominate earlier technologies for data production in statistical inquiry, and if adopted by the data collector, the investment required on the part of secondary analysts would be minimal.

9. What can NSF do?

NSF must make clear that it views each information system for complex data as a national laboratory facility. The facility would supply ongoing maintenance, assurance of a host for each complex data set, and a commitment through which necessary institutional arrangements can be secured. Because the information system is infrastructure for science it requires a different kind of peer review than the applications which can be made with the data.

NSF can facilitate relationships between software vendors and the information system. Our experience is that vendors do not understand the academic market and exercise monopolistic power in licensing software.

NSF can become a focus for pressure on the data producers to invest in information systems. If we are correct, and the technology is dominant over conventional means, this is a "win- win" situation.

NSF can sponsor planning meetings at which priorities for the scope of database laboratories can be established, and agreement can be reached on the standards that are appropriate to scientific design. Thorough understanding of the strengths and weaknesses of alternative software, and its likely evolution, can be reached. The latter activity will require close interface to the computer science community (which has not been educated about the complexity of typical social science retrieval activity).

10. How does this relate to the need for computation in the social sciences?

a. Fallacious conclusions

The cheapest available RDBMS systems run on microcomputers. Given the number of operations, memory, and storage available on those machines, they are extremely attractive for RDBMS work.

Experiments with SIPP data have established that Ethernet connections between MicroVAX and 80386SX machines makes downloading of data from the database engine to microcomputers economic. Handling very large-scale survey data in this way is feasible. High-data transfer rates are required, and more memory in the microcomputer is desirable. (One of the limitations of microcomputer versions of RDBMS is that they do not all support use of large memory stores.)

Development of the information system for complex data will need to take place on a machine that is networked to the NSFnet, and which can support multiple users. With SIPP, our experience shows that even a relatively large scope, large-scale survey can be handled with either a micro-mini computer (microVAX) or a well- endowed workstation. The limitations on development come from the scarcity of human capital -- people who know systems programming, database theory, research in social science and information science. Limitations also come from the failure to endow database development with control over the computer environment used. (The need to adapt queues, system parameters, updates, and storage to competing demands from other irrelevant activities imposes high costs on the development of a database.)

b. Development needs

Two major development needs are clear: This ends the discussion of complex data in particular as well as economic databases in general and opens the way for a discussion of the need for programming assistants.

H. Programming Assistants

Perhaps the most severe single constraint on computational economics is the forced reliance for programming support on graduate students who turn over every couple of years, and in any event are not the best suited to support computational infrastructure. In our opinion, NSF should provide support for full or part time computer assistants to help maintain networks of workstations and to provide programming support. The cost would be modest compared with the bottle-washers, laboratory technicians and post-doctoral fellows of laboratory science.

How can we reduce the costs of acquiring computer expertise? In our experience, NSF's policy of providing researchers with individual workstations has had a much bigger impact on the general level of computer expertise than providing extra supercomputer funding. It involves difficult issues of the proper balance of funding for "big science" vs. "little science." The supercomputer is frankly intimidating to many researchers: its sheer power and remoteness seem to discourage "playing around" that seems to be essential to really understanding how to use a computer effectively. On the other hand, a desktop workstation is a much more friendly device: it is always available whenever you turn it on, and there are no cpu-limits, account expiration deadlines, software revisions and "bug-fixes," and other bureaucratic hassles of working on the supercomputer. The "PC revolution" had such an amazing impact precisely because users had much more incentive to learn to understand a computer that was their own personal property than a computer they could only rent space on for an indefinite period of time. We find the administration of supercomputer accounts at the NSF supercomputer centers to involve a number of seemingly unnecessary bureaucratic obstacles. An example is NCSA's policy of setting relatively short 1-year account expiration deadlines. If one is doing anything that is slightly non-standard and encounters significant data or programming problems, it can be very hard to use up a cpu allocation in only one year. It would be much easier on the user to be allocated a block of cpu time with either no calendar expiration date, or a longer 2- to 4-year period before expiration. Office workstations do not impose such unnecessary deadlines, and as a result many researchers would rather use their workstations than a supercomputer even if their runs take 100 times longer.

Given the amazingly rapid improvement in chip technology, today's desktop workstations are essentially little supercomputers. Thus, it is getting to the point where even the personal workstation can be a fairly complex and intimidating object. To avoid wasting idle cpu cycles, we have seen a migration away from single-user, single-tasking operating systems like DOS to multi-user, multi- tasking operating systems like Unix and MACH. Owning a Unix workstation requires the owner to be a "super-user" in charge of system administration and this involves a fairly high degree of computer sophistication. For example, given the concern about network viruses and worms, it is important to make sure that various security procedures are adhered to, including setting up user logins and passwords and read/write permissions, monitoring of network access and disk IO, in addition to dealing with normal hardware and software maintenance. These tasks were formerly performed by a full-time professional; now they must be a sideline activity of any researcher who acquires their own workstation. This requires the user to acquire a fair amount of computer- specific human capital in order to effectively and safely use modern generation workstations.

The upshot is that while for sufficiently simple machines (like the early DOS PCs), the incentive effect of personal ownership was sufficient to overcome the relatively small human capital investment (resulting in a larger group of computer literate users), for the newer high-performance workstations the personal ownership incentive may not be enough: it may be optimal to revert back to the centralized bureaucratic mainframe cpu-time allocations to economize on duplication of computer-specific human capital.

There is an alternative that still allows a healthy degree of decentralization in computing resources: namely to have local "computer consultants" who can assist in the running and maintenance of many departmental workstations and help write software and answer questions of the non-specialist users. This type of arrangement is difficult to support under current funding arrangements: the money seems to be there to acquire the machinery, but not to pay for a professional to help manage, maintain and instruct new users about this machinery. It ends up diverting all the costs onto the professor, and eventually it becomes a deterrent to trying to acquire more computing resources.

While NSF does support student research assistants, for many detailed multi-year projects it can be inefficient to support a graduate student for a year, lose him or her to the job market and face incurring the search and retraining costs for a new graduate student. In our experience, the lack of students with sufficient expertise in computers, the high costs of training and the lack of continuity in one-year contracting arrangements ended up making it easier and safer for us to do our own programming. An intermediate situation that we would much prefer is to have a departmental "computer expert" who can help develop a "pipeline" of relatively trained student research assistants, while also providing the continuity to insure that if one student familiar with a detailed computing project quits, another substitute could be quickly found. Our experience would suggest that departments will need about one computer expert per twenty workstations.

I. Training of Graduate Students

The implications of the emergence of computational economics for the way we train people are becoming clear. It's essential for first- year graduate students to be fluent in computer software, as they must be in mathematics. Our experience is that economics is some years behind statistics -- not because of inherent differences in the problems, but because of differences in the ways computing has been treated within the structures of universities and funding agencies, and in peer evaluation. The model emerging in the near future will be one in which a workstation or associated high-level terminal is available to each graduate student. These machines will be used for classwork and research. In addition, with appropriate software for distributed processing, these machines will be available for intense background computational tasks (something which is essentially impossible in the PC architecture). It's a small step to distributed processing over networks. Currently, economics is behind statistics and other computationally-oriented fields in this regard, and the distance grows greater all the time. Much of the investment will have to come from universities. The funding requirement is shrinking, however, not growing. Ideally, an important contribution of NSF would be the one it has historically made in the lab sciences, the provision of training grants for graduate students.

Training in computational economics has specific implications for the continued training of mature investigators as well. Experience in statistics again suggests a model. In the summer of 1989, Nancy Flournoy from the Statistics and Probability Program arranged for one week of the annual AMS research meetings (at Arcata, that year) to be devoted to statistical multiple integration. It brought together statisticians, computer scientists and numerical analysts (40 or 50 all together) for a week. A primary reason for the meeting was that while multiple integration is a research topic for all three groups, the focus is different. In statistics the concern is with algorithms that work well in the vast majority of a loosely- structured set of cases, with diagnostics indicating when they do not. In contrast, numerical analysts seek procedures that work most reliably in worst-case scenarios for more narrowly defined problems. For several years John Geweke had worked in this area with the suspicion that he might be reinventing someone else's wheel. The meeting quickly brought the difference in orientation into focus, and showed that the attendees were working on related but distinct problems. A few new ideas were provided by the numerical analysts and a few by the computer scientists. Mostly, John learned that the numerical analysts were not able to help with statistical multiple integration problems. This result has been important in directing his research since then, and it only could have emerged through a meeting like the one in Arcata. The wheel-reinvention problem, discussed by some of his panel colleagues, is a very real one. Meetings with a specific focus to bring together well-chosen groups in economics and other areas (computer science, mathematics or other fields) would be very productive.

J. Postdoctoral Programs

It is not common to have postdoctoral programs in economics - perhaps in part because economics is not a laboratory science. However, computational economics has some of the flavor of a laboratory science and we suggest that some postdoctoral programs be initiated by NSF on an experimental basis. A subset of these fellowships might be set aside to take advantage of an important opportunity that we perceive. Thirty years ago most of the Ph.D. programs in economics were in the developed countries. That situation has now changed and many first-rate Ph.D.s are now being trained in the developing countries. Those students could benefit substantially from an opportunity to work as a postdoctoral student for a few years in a U.S. university where they could become acquainted with some computer hardware and software which may not be available in their home country.

In summary then, we suggest three actions on postdoctoral programs. First, principal investigators should be encouraged to include postdocs in their proposals where that is appropriate. Second, there should be a national competition for postdoctoral awards the way there is now a program for doctoral students. Third, a portion of this last program should be reserved for students who have received their Ph.D. degrees in the developing countries.

K. Continuing Education

In addition to being able to push the frontiers of research in computational economics, the computational economics initiative can go a long way in helping speed up the education of the professorate. Only a small percentage of researchers are knowledgeable about the most up-to-date equipment, software, techniques, etc. Given the large investment of time that is needed simply to gather information about how to change the configuration of an individual's or department's computer usage, educational activities could significantly reduce the search costs to the profession. Specific suggestions include short courses, workshops, lectures and newsletters. For example, at the annual American Economics Association (AEA) meetings and at one or more of the more widely attended regional meetings (such as the Western or Southern) there could be one session on "recent developments" in hardware, another in software, another on new algorithms, etc. Alternatively, there could be a short course with a nominal charge on each of these topics, or a one-day course encompassing all three. It would also be a great benefit to the profession if the AEA could be persuaded to begin a newsletter (to be published perhaps twice a year) identifying new developments in these areas. There is already an AEA committee on Economic Education, so perhaps it could get involved or another committee could be formed.

There has been much discussion of the problem of "wheel re- invention" and NSF could certainly play a role here. One common complaint is that there is no mechanism for sharing code and algorithms or at least getting information out to the community about small- or large-scale programs. Part of this is wrapped up in the incentive problem, as well. If it takes a researcher days or weeks to get a program written and modified and running, there are no incentives to share with colleagues. Although a journal, such as the new Computer Science in Economics and Management journal, can help by publishing articles on software, smaller scale and more quickly disseminated methods are also needed. Perhaps a bulletin board arrangement of some kind could be set up and every time a program is used or requested it would be counted as a citation. Although it clearly would not "count" as much as a publication, it would be appropriate for smaller scale problems such as writing Gauss code or other things along those lines. NSF could help develop and underwrite such a project which could eventually be taken over by the AEA.

Another way that NSF funds could help in this area would be by sponsoring institute-like activities that bring together computer scientists and econometricians, or theoretical and applied econometricians. Such a program could be modeled after the new NSF initiative for institute-like activities to bring together statisticians and econometricians, which is being jointly funded by the Economics Program and the Probability and Statistics Program.

L. Digital Journals

Access to scientific information will be made easier and more precise as refereed journals become digital in form. The vision is to provide direct access over the NREN to journal articles, or portions of them, commentaries on them and pointers to related work. There are about a dozen refereed journals on the network now. Most of them are not in scientific fields. But that will change. We are not sure the existing ones fit our model of a digital journal. What is that model? First, it is not a formal collection of articles bound into volumes and owned by a publishing house. That is a model dictated by the economics and organization of the print technology. With the ability to store bits instead of pages, and to conduct intelligent searches over massive amounts of data located on distributed databases, a new model evolves. It allows the scientist to reference subjects, authors, data sets, and multi-media material including images. Parameters of such searches are flexible and more finely tuned than the way we search on printed articles (by date, by journal and by author). For example, retrieval should be possible on qualitative factors such as how well liked by peer reviewers, or the perceived excellence of the "journal." It should be possible, also, to retrieve portions of articles and data sets, and portions of related articles down to a fine level of detail.

Four kinds of developments are needed before the vision can become reality.

First, the network architecture and capacity to handle high- volume, high-density file search and retrieval has to be designed. This is well under way in the NSFNET research program.

Second, the design of large-scale, heterogeneous, distributed databases has to be accomplished. This is a basic research challenge that is just beginning. Currently, NSF has a Scientific Database initiative underway, and there are also initiatives in biology and geo-sciences sections of NSF.

Third, basic research is needed on how to design intelligent agents on the network that can search across many distributed databases for the information needed. Funding of this research is also relatively new in the Computer and Information Sciences and Engineering Directorate of NSF.

Fourth, we need to design the institutions and incentives that will encourage an industry (or its economic equivalent) of digital journals. This is a challenge for economists interested in the design of institutions and mechanisms as well as for lawyers who are concerned with intellectual property issues. Academics, generally, need to be concerned with the implications of digital journals for the system of priority and the determinants of tenure decisions. It may not be too early for interested economics journals or new entrants to propose experiments on these problems.


[ Next | Previous | Table of Contents | Title Page ]