1 August, 2003
Kees Mandemakers and Lisa Dillon
Best practices with large databases on historical populations
Since the late 1960s researchers who transform routinely-generated primary sources into machine-readable data have produced numerous methodological articles and book chapters detailing the process of data creation and describing how the peculiarities of primary sources can affect interpretation of the data. However, in such articles, the best practices for creating large databases have usually been implicit rather than explicit. Here a comprehensive list of best practices for the creation of large databases on historical populations is introduced, drawing upon the experiences of the Historical Sample of the Netherlands, the Canadian Families Project, the IPUMS and other projects.
The following guidelines represent a revised version of the protocol formulated on the occasion of the workshop of the Historical Sample of the Netherlands (HSN) which took place under the title ‘HSN-workshop on large databases: Results and best practices’ at Amsterdam on 17-18 May 2001. All the participants are to be thanked for their comments on this earlier version which have been incorporated as much as possible into these guidelines.
This protocol is not exhaustive, but it does address those principles which should be considered by all projects. We use the phrase ‘should be considered’ because we are well aware that other factors such as cost-benefit considerations may make it useful or necessary to make alternate choices than those advocated by this protocol. In general, diverging from one or more of the recommendations described below is not problematic as long as the argument for these choices is made clear to the users and future owners of the database. Thus, adherence to this protocol does not mean that every principle must be applied, yet the reasons for not following a specific rule or recommendation should be explained.
We consider here only those large databases which have the goal to be open for secondary analysis and which are essentially open systems, to which data may be periodically added.
The purpose of maintaining a set of best practices for creating large databases on historical populations is to:
1. articulate the standards necessary to create and maintain high-quality databases and database documentation. Appropriate funding must be reserved to achieve these standards.
2. ensure that the historical community can trust the results of research based on these data.
3. benefit from previous experience in creating large databases. Thus, these rules can also be applied to data created by non-academic organizations such as genealogical societies or the Church of Jesus Christ of Latter-day Saints.
4. ensure databases are of sufficient quality for use by secondary researchers (researchers outside the main database creation team).
For the sake of clarity and argument and to stimulate further discussion, this list of best practices is written in the form of a specific set of rules. However, since every organization bears ultimate responsibility for its own work, this protocol should be read as a specific set of recommendations. Practical examples have been supplied in those instances in which we felt they would help illuminate a particular recommendation; however to keep this set of rules lean, we added examples with some prudence.
Finally, we wish to stress that we are conscious that these recommendations are based on technology and expertise which will evolve with time. Rather than attempt to forecast future developments, we recommend that this protocol be revisited and periodically revised in keeping with new developments and insights. We invite readers to contribute to periodic revisions of this protocol by forwarding comments directly to the authors (Kees Mandemakers, firstname.lastname@example.org and Lisa Dillon, email@example.com) or by addressing public comments to the H-DEMOG listserv.
Best Practices with large databases on historical populations
List of recommendations for the practice of working with large databases on historical populations.
We define large databases as databases which are designed to fulfill one or all of the following purposes: serve more than one goal and/or researcher; support secondary analysis; and enable extension of the database through the addition of other material or the linking of the database to other databases.
During the data creation process, from data entry to data release, a large number of decisions are made which influence the quality of the final product. Potential user groups should be consulted on all important points of database planning. Three major stages are distinguished in this process:
A. Definition of the objective and content of the database, selection of sources and sample criteria.
B. Data entry, integration, standardization and storage
C. Enrichment and release of data
A website should be established from the inception of the database. All documentation must be made available online. This website should include among its elements an overview of all developed projects. All decisions which will be made with respect to the following list of best practices should also be published on the website of the institutions which underwrite this protocol.
A Universe of the database, selection of sources and sample criteria.
A1 The universe of the database—in other words, the goal and content of the database considered together—should be defined very clearly. If a particular source, such as a census, is chosen to form the core of the database, the rationale behind this choice should be documented.
A2 The type of primary source which will be transformed into machine-readable format will be primarily dependent on the universe of the database and the availability of sources. In principle all available information recorded in the source should be included in the database. Database creators may choose to omit certain variables in order to save data entry time and/or money. However, such choices should be valued against the possible future cost of restoring omitted variables. The motivation for choosing each source should be documented extensively.
A3 The documentation of each source should include at least the following elements:
a the genesis and historical context of the source (how did this source function in its administrative and social contexts)
b an image of each used source and other ancillary material which may be of use to the user, such as enumerator instructions
c an estimation of the reliability of the source
d relation of the different primary sources included in this database to each other and to other primary sources which are not yet included in the database but which are relevant to the content of the database. Examples include the relationship of household registers to tax registers (taxpayers in the household are identified and appropriate linkage procedures are described) and a short description of sources which can be linked to the database in the future
e international comparability of the source
f the way the source should and possibly can be interpreted (at the variable level)
g the rules which are used in the process of transcription and estimates of transcription and coding errors
h the conditions under which the data are entered into the machine
i extensive description of all original and constructed variables which describes the information that can be extracted from the primary sources and specifies the universe of each variable
A4 If the database or parts of it consist of a sample, the following rules and recommendations shall be applied:
a A stratified sample design is recommended to enhance sample precision for characteristics such as region or period and to ease select oversampling.
b Within the chosen strata, the random sampling method is to be preferred to systematic sampling on the basis of risky criteria such as specific letters or every nth person, a sample design which can result in a heavy bias on unrelated persons. In some instances, database creators may choose to combine random sampling methods with cluster sampling, in order to take advantage of the presence of household groupings in the primary source.
c The design and execution of each sample should be carefully documented.
d Whether the database is based on a complete population or a sample, it is recommended that protocols be developed to clarify how subsequent samples can be designed and drawn. For example, when a machine-readable sample of individuals includes additional individuals on the basis of household relationships, database creators should evaluate and describe the representativity of these households. Conversely, database creators who had decided to designate the household as the sample point should describe the representativity of individuals in the household. Such protocols are especially important when database creators develop longitudinal databases and link to the central database new sources which can only be partly processed. Additional sources which can be linked to the database include deeds, tax records, land registers and parish registers.
B Data entry, integration, standardization and storage.
From a logical point of view, the database consists of at least three different components: SOURCES, CENTRAL DATABASE and DATA RELEASES. The division of the database into these three components should be followed rigorously. Of course, many different versions of the data will be produced in each component and subsequently archived, discarded or overwritten at various stages of the project. However, the three components discussed here represent important distinct logical parts of the database.
SOURCES: In the course of data entry, the source should be replicated literally, with no form of standardization taking place during data entry. Data entry should proceed so as to ensure minimal data loss. The data entry process should preclude the necessity of further inspection of the original primary source. Literal transcription of the sources also allows different researchers to encode variables as they see fit. Even seemingly straightforward interpretations of references such as “ditto” should be made not at the stage of data entry but at a later stage through automatic and repeatable transformations. To facilitate the data entry process, however, abbreviations can be used and variations of “ditto” can be standardized in one word. In addition, it can be crucial to add interpretations of ambiguous responses in square brackets. For example, a widowed head of household whose relationship to the household head is given as “widow” may be transcribed in the database as “widow [head]”. A sampled head of household whose occupation is given as “ditto” can be recorded in the database as “ditto [farmer].” All forms of interpretation must be guided by a protocol which directs the data entry process.
CENTRAL DATABASE: Data enhancement, namely the standardization of values and integration of different data sources, will occur in a second phase in which the central database is created.
DATA RELEASES: Data releases are generated from the central database. A data release may consist of the data as they exist in the original source or in the central database without any change or addition.
This logical partition of database development into three components is essential. By keeping the SOURCE data, CENTRAL DATABASE and DATA RELEASES separate, procedures for standardization, interpretation and integration can always be repeated and errors can be rectified. In addition, the rules of these procedures can be changed without jeopardising the integrity of the database. Each component (or parts of these components) can be produced or revised in different stages independently of each other. Keeping these three data components separate will also facilitate the adoption of legislation on data protection.
The following sections elaborate upon best practices in the creation of SOURCES, the CENTRAL DATABASE and DATA RELEASES. DATA RELEASES will also be addressed in section C.
Data transcribers and data entry operators who follow the basic principle of literal data transcription can face several challenges:
1 Data are difficult to read or are spelled incorrectly.
2 Data are misread by the transcriber or data-entry operator.
3 Data have been entered correctly, but the source itself is wrong. Errors in the source can result from:
3a a mistake of the clerk then in charge
3b a mistake the clerk could not avoid because he was supplied with incorrect information. Correct and incorrect responses in the original source can be difficult to distinguish. For example, did witnesses themselves know how their name should be written by the official?
3c the fact that the official rules which governed the lay-out and content of routinely-generated sources such as civil acts and population registers were not always available when these sources were originally created. In addition, the persons who created these sources, such as clerks or census respondents, did not always follow rules in the same manner.
4 Characteristics of particular individuals often vary from source to source. For example, a person’s date of birth or the spelling of their last name can change from one document to another. These errors result from:
4a mistakes of the type described in points 1 to 3.
4b changes which take place as the individual grows older. Some of these transitions, such as a name change at marriage, are official while others, such as changing last names to more socially desirable ones, are informally initiated and therefore unofficial.
4c inconsistent responses by different people to the same question, since answers can depend upon perceptions of the data collection exercise.
Given these challenges with literal data transcription, the following rules should be honoured as data from different sources are standardized and integrated:
B1 It must always be possible to make corrections in the source component(s) of the database. A database continuously evolves as further data are linked to it. During this process it should be possible to resolve ambiguous values by correcting one or more of the source components. These procedures should include the correction of mistakes, such as event date errors, discovered when linking additional data.
B2 The processes of standardization and integration should be iterative. These repeating procedures are only made possible by preserving a clear distinction between the literal transcription of the data in the SOURCE and the standardized version of the data in the CENTRAL DATABASE. Changes in the source components or in tables with standard values can automatically or semi-automatically lead to changes in the contents of the central database. This distinction also makes it possible to systematize the release of different versions of the data. Database creators may choose to release different versions of the data not only because standards or procedures in resolving data inconsistencies evolve over time but also because the way data are standardized or integrated may be dependent upon the specific goal of a particular research project or publication.
B3 The rules by which data from different sources are integrated should be documented. This recommendation applies to datasets which link information about individuals and family members to construct unique biographies and reconstitute families; it also applies to all other linked information such as contextual economic or geographic information.
B4 To standardize the spelling of values drawn from different sources, the following conventions are recommended. Spelling standards in the CENTRAL DATABASE and DATA RELEASES will be set primarily by data drawn from higher quality primary sources, such as vital acts, and secondly from lower-quality sources such as population registers or tax records. In the case of spelling inconsistencies between sources of the same quality, the standard could be set by the most recent source; for example, spelling used in death certificate data could prevail over spelling used in birth certificate data.
B5 The decisions by which database creators make case-by-case, manual corrections of mistakes in the original source (errors or discrepancies of the nature discussed in points 1 and 3a) should be clearly documented in the SOURCES component itself by preserving the clearly marked original text or figure. Preserving the original text enables researchers to make a different interpretation. Data transcription forms and data-entry software should include comment fields so that transcribers and data-entry operators have the opportunity to clarify their manual corrections. All manual corrections are undertaken in the SOURCES files; in the CENTRAL DATABASE only automatic corrections are made by applying rules designed to correct source-based mistakes.
B6 Data which are inferred by means of logic or estimation should be stored in a different variable or identified by a flag variable. The manner in which such inferences are made should be recorded. This documentation is especially important in the case of dating the data for particular research methods such as event history analysis. While in many cases, the primary source does not provide a specific date, it is often possible to make an educated guess. Like other kinds of inferred data, estimations of dates should be standardized. If different rules of inference are applied to the same variable, an additional variable is needed which describes the rules by which the dates were estimated or inferred.
B7 All electronic material relating to the database should be stored, preserved and made obtainable in the simplest format possible whether or not it is also released in proprietary software formats such as Microsoft Access, SPSS and SAS. The choice of character set should be documented.
B8 To prevent data losses back-ups of working files must be made at regular intervals. In addition, a copy of the whole dataset should be preserved in a disaster-proof environment, namely a data archive such as the Inter-University Consortium for Political and Social Research (ICPSR) or the UK Data Archive.
C. Enrichment and release of data.
When disseminating the data, the following procedures should be honoured
C1 Rules laid by national and supranational data protection acts must be respected and honoured. The rules which concern the database in question should be made public, for example, by publishing the constitution under which the data were gathered and released.
C2 Releases of all or part of the data-set should be accompanied by, at minimum, the release name and release date. If the release name is not unique, a serial number must be added as well. These release features must be included in all tables and files which comprise the release. The correct manner for citing the dataset in journals should also be made explicit. New versions of existing releases should highlight the changes which have been made since the previous one.
C3 Provision should be made for broad public access to the data, either at no cost or limited to handling charges. In order to comply with confidentiality laws it may be necessary to provide the data in an anonymized format, in which case the documentation should state that information has been removed from this release of the data and describe which information has been removed. Anonymization of data is particularly important in the case of databases which include information about the causes of death or which cover the complete life course of individuals and families who span both historical and contemporary periods. In such instances, it is recommended to anonymize all cases less than one hundred years old, even if there are no legal objections against full publication.
C4 Data may be distributed freely or through specified contracts with individual users. When the database is not freely distributed, conditions on the use of data will be regulated by contract. If the released dataset is freely distributed by the institution, users are not allowed to a) disseminate altered versions of the data or make additions without consulting the administrators of the database, or b) charge fees for use or distribution, and users should c) cite the database according to the conditions of use and d) send a copy of all publications, research reports or educational material based on the dataset to the owners of the license.
C5 Variables should be standardized as much as possible without losing content. Standardization should be effected in such a way that original values are retained and new ones are kept in separate variables or tables. Wherever possible releases will include the internationally standardized versions of at least certain complex variables, such as occupation and birthplace.
C6 The reasons for missing data should be explained. Numbers such as “0” and “99”, which are meaningful in certain contexts such as age, should not be used to indicate missing information. Instead, negative codes should be used. The following four codes are recommended as standard values for lacking information:
-1 Not available in the source
-2 Not readable in the source
-3 Not available for reasons of privacy but existing in the database
-4 For reasons of privacy not taken into the database
Other negative codes can be used for reasons specified in the documentation.
C7 All geographic information should be suitably geo-referenced, to make it possible to relate the data to Geographical Information Systems.
C8 To assist researchers unfamiliar with research designs which feature statistical tests, releases should refer to model studies which employ the dataset or data similar to it.
C9 To encourage researchers unfamiliar with large databases ‘easy user versions’ of existing releases should also be produced.
C10 The Data Documentation Initiative (DDI) developed by the ICPSR and participating institutions such as Networked Social Science Tools and Resources (NESSTAR) should be adopted to guide the development of metadata, or online documentation. The use of DDI will ensure that metadata has been produced “in a uniform, highly structured format that is easily and precisely searchable on the Web, that lends itself well to simultaneous use of multiple datasets, and that will significantly improve the content and usability of metadata.”
C11 Besides the already mentioned documentation in A3 & B3 the website of the database should include the following elements:
a overview of all developed projects
b decisions with respect to this protocol
c international comparability of the database
d explanation of the database design
e if the database is a sample, explanation of the sample design and provision of standard errors for major variables
f list of available variables and values
g variable descriptions, including the universe for each variable and frequency distributions for coded variables
h information on variable compatibility across time and space
i information about database construction and software
j conditions on the use of the data
k revision history of the database
l information about proper citation of the database
m bibliography of model studies
n contact information
o acknowledgement of all participants in the creation of the database and of funding sources
C12 To incite interest among potential users, information about the progress of the database should be published on the website at regular intervals. Such reports can consist of updates on the number of cases entered into the database or the extent to which data have been coded and prepared for release. Database documentation must be published on the website beginning with the first release of the database.