1 August, 2003
Kees Mandemakers and Lisa Dillon
Best practices with large databases on historical populations
Since the late 1960s researchers who transform routinely-generated primary
sources into machine-readable data have produced numerous methodological
articles and book chapters detailing the process of data creation and describing
how the peculiarities of primary sources can affect interpretation of the
data. However, in such articles, the best practices for creating large
databases have usually been implicit rather than explicit. Here a comprehensive
list of best practices for the creation of large databases on historical
populations is introduced, drawing upon the experiences of the Historical
Sample of the Netherlands, the Canadian Families Project, the IPUMS and other
projects.
The following guidelines represent a revised version of the protocol formulated
on the occasion of the workshop of the Historical Sample of the Netherlands
(HSN) which took place under the title ‘HSN-workshop on large databases:
Results and best practices’ at Amsterdam on 17-18 May 2001. All the
participants are to be thanked for their comments on this earlier version
which have been incorporated as much as possible into these guidelines.
This protocol is not exhaustive, but it does address those principles which
should be considered by all projects. We use the phrase ‘should be considered’
because we are well aware that other factors such as cost-benefit considerations
may make it useful or necessary to make alternate choices than those advocated
by this protocol. In general, diverging from one or more of the recommendations
described below is not problematic as long as the argument for these choices
is made clear to the users and future owners of the database. Thus, adherence
to this protocol does not mean that every principle must be applied, yet
the reasons for not following a specific rule or recommendation should be
explained.
We consider here only those large databases which have the goal to be open
for secondary analysis and which are essentially open systems, to which data
may be periodically added.
The purpose of maintaining a set of best practices for creating large databases
on historical populations is to:
1. articulate the standards necessary to create and maintain
high-quality databases and database documentation. Appropriate funding
must be reserved to achieve these standards.
2. ensure that the historical community can trust the results
of research based on these data.
3. benefit from previous experience in creating large databases.
Thus, these rules can also be applied to data created by non-academic organizations
such as genealogical societies or the Church of Jesus Christ of Latter-day
Saints.
4. ensure databases are of sufficient quality for use by
secondary researchers (researchers outside the main database creation team).
For the sake of clarity and argument and to stimulate further discussion,
this list of best practices is written in the form of a specific set of rules.
However, since every organization bears ultimate responsibility for its own
work, this protocol should be read as a specific set of recommendations.
Practical examples have been supplied in those instances in which we felt
they would help illuminate a particular recommendation; however to keep this
set of rules lean, we added examples with some prudence.
Finally, we wish to stress that we are conscious that these recommendations
are based on technology and expertise which will evolve with time.
Rather than attempt to forecast future developments, we recommend that this
protocol be revisited and periodically revised in keeping with new developments
and insights. We invite readers to contribute to periodic revisions
of this protocol by forwarding comments directly to the authors (Kees Mandemakers,
kma@iisg.nl and Lisa Dillon, ly.dillon@umontreal.ca) or by addressing public
comments to the H-DEMOG listserv.
Protocol
Best Practices with large databases on historical populations
List of recommendations for the practice of working with large databases
on historical populations.
We define large databases as databases which are designed to fulfill one
or all of the following purposes: serve more than one goal and/or researcher;
support secondary analysis; and enable extension of the database through
the addition of other material or the linking of the database to other databases.
During the data creation process, from data entry to data release, a large
number of decisions are made which influence the quality of the final product.
Potential user groups should be consulted on all important points of database
planning. Three major stages are distinguished in this process:
A. Definition of the objective and content of the database,
selection of sources and sample criteria.
B. Data entry, integration, standardization and storage
C. Enrichment and release of data
A website should be established from the inception of the database. All documentation
must be made available online. This website should include among its elements
an overview of all developed projects. All decisions which will be made with
respect to the following list of best practices should also be published
on the website of the institutions which underwrite this protocol.
A Universe of the database, selection of sources and
sample criteria.
A1 The universe of the database—in other words, the goal
and content of the database considered together—should be defined very clearly.
If a particular source, such as a census, is chosen to form the core of the
database, the rationale behind this choice should be documented.
A2 The type of primary source which will be transformed
into machine-readable format will be primarily dependent on the universe
of the database and the availability of sources. In principle all available
information recorded in the source should be included in the database. Database
creators may choose to omit certain variables in order to save data entry
time and/or money. However, such choices should be valued against the
possible future cost of restoring omitted variables. The motivation for choosing
each source should be documented extensively.
A3 The documentation of each source should include
at least the following elements:
a the genesis and historical context of the source (how
did this source function in its administrative and social contexts)
b an image of each used source and other ancillary material
which may be of use to the user, such as enumerator instructions
c an estimation of the reliability of the source
d relation of the different primary sources included in
this database to each other and to other primary sources which are not yet
included in the database but which are relevant to the content of the database.
Examples include the relationship of household registers to tax registers
(taxpayers in the household are identified and appropriate linkage procedures
are described) and a short description of sources which can be linked to
the database in the future
e international comparability of the source
f the way the source should and possibly can be interpreted
(at the variable level)
g the rules which are used in the process of transcription
and estimates of transcription and coding errors
h the conditions under which the data are entered into
the machine
i extensive description of all original and constructed
variables which describes the information that can be extracted from the
primary sources and specifies the universe of each variable
A4 If the database or parts of it consist of a sample,
the following rules and recommendations shall be applied:
a A stratified sample design is recommended to enhance
sample precision for characteristics such as region or period and to ease
select oversampling.
b Within the chosen strata, the random sampling
method is to be preferred to systematic sampling on the basis of risky criteria
such as specific letters or every nth person, a sample design which
can result in a heavy bias on unrelated persons. In some instances,
database creators may choose to combine random sampling methods with cluster
sampling, in order to take advantage of the presence of household groupings
in the primary source.
c The design and execution of each sample should be carefully
documented.
d Whether the database is based on a complete population
or a sample, it is recommended that protocols be developed to clarify how
subsequent samples can be designed and drawn. For example, when a machine-readable
sample of individuals includes additional individuals on the basis of household
relationships, database creators should evaluate and describe the representativity
of these households. Conversely, database creators who had decided
to designate the household as the sample point should describe the representativity
of individuals in the household. Such protocols are especially important
when database creators develop longitudinal databases and link to the central
database new sources which can only be partly processed. Additional
sources which can be linked to the database include deeds, tax
records, land registers and parish registers.
B Data entry, integration, standardization
and storage.
From a logical point of view, the database consists of at least three different
components: SOURCES, CENTRAL DATABASE and DATA RELEASES. The division of
the database into these three components should be followed rigorously.
Of course, many different versions of the data will be produced in each component
and subsequently archived, discarded or overwritten at various stages of
the project. However, the three components discussed here represent
important distinct logical parts of the database.
SOURCES: In the course of data entry, the source should be replicated
literally, with no form of standardization taking place during data entry.
Data entry should proceed so as to ensure minimal data loss. The data entry
process should preclude the necessity of further inspection of the original
primary source. Literal transcription of the sources also allows different
researchers to encode variables as they see fit. Even seemingly straightforward
interpretations of references such as “ditto” should be made not at the stage
of data entry but at a later stage through automatic and repeatable transformations.
To facilitate the data entry process, however, abbreviations can be used
and variations of “ditto” can be standardized in one word. In addition,
it can be crucial to add interpretations of ambiguous responses in
square brackets. For example, a widowed head of household whose relationship
to the household head is given as “widow” may be transcribed in the database
as “widow [head]”. A sampled head of household whose occupation is given
as “ditto” can be recorded in the database as “ditto [farmer].” All forms
of interpretation must be guided by a protocol which directs the data entry
process.
CENTRAL DATABASE: Data enhancement, namely the standardization of values
and integration of different data sources, will occur in a second phase in
which the central database is created.
DATA RELEASES: Data releases are generated from the central database.
A data release may consist of the data as they exist in the original source
or in the central database without any change or addition.
This logical partition of database development into three components is essential.
By keeping the SOURCE data, CENTRAL DATABASE and DATA RELEASES separate,
procedures for standardization, interpretation and integration can always
be repeated and errors can be rectified. In addition, the rules of
these procedures can be changed without jeopardising the integrity of the
database. Each component (or parts of these components) can be produced or
revised in different stages independently of each other. Keeping these three
data components separate will also facilitate the adoption of legislation
on data protection.
The following sections elaborate upon best practices in the creation of SOURCES,
the CENTRAL DATABASE and DATA RELEASES. DATA RELEASES will also be addressed
in section C.
Data transcribers and data entry operators who follow the basic principle
of literal data transcription can face several challenges:
1 Data are difficult to read or are spelled incorrectly.
2 Data are misread by the transcriber or data-entry operator.
3 Data have been entered correctly, but the source itself
is wrong. Errors in the source can result from:
3a a mistake of the clerk then in charge
3b a mistake the clerk could not avoid because he was supplied
with incorrect information. Correct and incorrect responses in the original
source can be difficult to distinguish. For example, did witnesses
themselves know how their name should be written by the official?
3c the fact that the official rules which governed the
lay-out and content of routinely-generated sources such as civil acts and
population registers were not always available when these sources were originally
created. In addition, the persons who created these sources, such as
clerks or census respondents, did not always follow rules in the same manner.
4 Characteristics of particular individuals often vary
from source to source. For example, a person’s date of birth or the
spelling of their last name can change from one document to another. These
errors result from:
4a mistakes of the type described in points 1 to 3.
4b changes which take place as the individual grows older.
Some of these transitions, such as a name change at marriage, are official
while others, such as changing last names to more socially desirable ones,
are informally initiated and therefore unofficial.
4c inconsistent responses by different people to the same
question, since answers can depend upon perceptions of the data collection
exercise.
Given these challenges with literal data transcription, the following rules
should be honoured as data from different sources are standardized and integrated:
B1 It must always be possible to make corrections
in the source component(s) of the database. A database continuously evolves
as further data are linked to it. During this process it should be possible
to resolve ambiguous values by correcting one or more of the source components.
These procedures should include the correction of mistakes, such as event
date errors, discovered when linking additional data.
B2 The processes of standardization and integration should
be iterative. These repeating procedures are only made possible by preserving
a clear distinction between the literal transcription of the data in the
SOURCE and the standardized version of the data in the CENTRAL DATABASE.
Changes in the source components or in tables with standard values can automatically
or semi-automatically lead to changes in the contents of the central database.
This distinction also makes it possible to systematize the release of different
versions of the data. Database creators may choose to release different versions
of the data not only because standards or procedures in resolving data inconsistencies
evolve over time but also because the way data are standardized or integrated
may be dependent upon the specific goal of a particular research project
or publication.
B3 The rules by which data from different sources are integrated
should be documented. This recommendation applies to datasets which link
information about individuals and family members to construct unique biographies
and reconstitute families; it also applies to all other linked information
such as contextual economic or geographic information.
B4 To standardize the spelling of values drawn from different
sources, the following conventions are recommended. Spelling standards in
the CENTRAL DATABASE and DATA RELEASES will be set primarily by data drawn
from higher quality primary sources, such as vital acts, and secondly from
lower-quality sources such as population registers or tax records. In the
case of spelling inconsistencies between sources of the same quality, the
standard could be set by the most recent source; for example, spelling used
in death certificate data could prevail over spelling used in birth certificate
data.
B5 The decisions by which database creators make case-by-case,
manual corrections of mistakes in the original source (errors or discrepancies
of the nature discussed in points 1 and 3a) should be clearly documented
in the SOURCES component itself by preserving the clearly marked original
text or figure. Preserving the original text enables researchers to make
a different interpretation. Data transcription forms and data-entry
software should include comment fields so that transcribers and data-entry
operators have the opportunity to clarify their manual corrections.
All manual corrections are undertaken in the SOURCES files; in the CENTRAL
DATABASE only automatic corrections are made by applying rules designed to
correct source-based mistakes.
B6 Data which are inferred by means of logic or estimation
should be stored in a different variable or identified by a flag variable.
The manner in which such inferences are made should be recorded. This
documentation is especially important in the case of dating the data for
particular research methods such as event history analysis. While in many
cases, the primary source does not provide a specific date, it is often possible
to make an educated guess. Like other kinds of inferred data, estimations
of dates should be standardized. If different rules of inference are applied
to the same variable, an additional variable is needed which describes the
rules by which the dates were estimated or inferred.
B7 All electronic material relating to the database should
be stored, preserved and made obtainable in the simplest format possible
whether or not it is also released in proprietary software formats such as
Microsoft Access, SPSS and SAS. The choice of character set should be documented.
B8 To prevent data losses back-ups of working files must
be made at regular intervals. In addition, a copy of the whole dataset should
be preserved in a disaster-proof environment, namely a data archive such
as the Inter-University Consortium for Political and Social Research (ICPSR)
or the UK Data Archive.
C. Enrichment and release of data.
When disseminating the data, the following procedures should be honoured
C1 Rules laid by national and supranational data protection
acts must be respected and honoured. The rules which concern the database
in question should be made public, for example, by publishing the constitution
under which the data were gathered and released.
C2 Releases of all or part of the data-set should be accompanied
by, at minimum, the release name and release date. If the release name
is not unique, a serial number must be added as well. These release features
must be included in all tables and files which comprise the release. The
correct manner for citing the dataset in journals should also be made explicit.
New versions of existing releases should highlight the changes which have
been made since the previous one.
C3 Provision should be made for broad public access to
the data, either at no cost or limited to handling charges. In order to comply
with confidentiality laws it may be necessary to provide the data in an anonymized
format, in which case the documentation should state that information has
been removed from this release of the data and describe which information
has been removed. Anonymization of data is particularly important in the
case of databases which include information about the causes of death or
which cover the complete life course of individuals and families who span
both historical and contemporary periods. In such instances, it is recommended
to anonymize all cases less than one hundred years old, even if there are
no legal objections against full publication.
C4 Data may be distributed freely or through specified
contracts with individual users. When the database is not freely distributed,
conditions on the use of data will be regulated by contract. If the released
dataset is freely distributed by the institution, users are not allowed to
a) disseminate altered versions of the data or make additions without consulting
the administrators of the database, or b) charge fees for use or distribution,
and users should c) cite the database according to the conditions of use
and d) send a copy of all publications, research reports or educational material
based on the dataset to the owners of the license.
C5 Variables should be standardized as much as possible
without losing content. Standardization should be effected in such a way
that original values are retained and new ones are kept in separate variables
or tables. Wherever possible releases will include the internationally standardized
versions of at least certain complex variables, such as occupation and birthplace.
C6 The reasons for missing data should be explained. Numbers
such as “0” and “99”, which are meaningful in certain contexts such as age,
should not be used to indicate missing information. Instead, negative
codes should be used. The following four codes are recommended as standard
values for lacking information:
-1 Not
available in the source
-2
Not readable in the source
-3 Not
available for reasons of privacy but existing in the database
-4 For
reasons of privacy not taken into the database
Other negative codes
can be used for reasons specified in the documentation.
C7 All geographic information should be suitably geo-referenced,
to make it possible to relate the data to Geographical Information Systems.
C8 To assist researchers unfamiliar with research designs
which feature statistical tests, releases should refer to model studies which
employ the dataset or data similar to it.
C9 To encourage researchers unfamiliar with large databases
‘easy user versions’ of existing releases should also be produced.
C10 The Data Documentation Initiative (DDI) developed by
the ICPSR and participating institutions such as Networked Social Science
Tools and Resources (NESSTAR) should be adopted to guide the development
of metadata, or online documentation. The use of DDI will ensure that
metadata has been produced “in a uniform, highly structured format that is
easily and precisely searchable on the Web, that lends itself well to simultaneous
use of multiple datasets, and that will significantly improve the content
and usability of metadata.”
C11 Besides the already mentioned documentation in A3 &
B3 the website of the database should include the following elements:
a overview of all developed projects
b decisions with respect to this protocol
c international comparability of the database
d explanation of the database design
e if the database is a sample, explanation
of the sample design and provision of standard errors for major variables
f list of available variables and values
g variable descriptions,
including the universe for each variable and frequency distributions for
coded variables
h information on variable compatibility
across time and space
i information about database construction
and software
j conditions on the use of the data
k revision history of the database
l information about proper citation
of the database
m bibliography of model studies
n contact information
o acknowledgement of all participants
in the creation of the database and of funding sources
C12 To incite interest among potential users, information
about the progress of the database should be published on the website at
regular intervals. Such reports can consist of updates on the number of cases
entered into the database or the extent to which data have been coded and
prepared for release. Database documentation must be published on the website
beginning with the first release of the database.