PRDH 1852 :: Reaction by prof. Kris Inwood

1852 Oversampling Strategy

Reaction by prof. Kris Inwood

The 1852 Census of Upper and Lower Canada:
Proposed Oversampling Strategy, and Discussion

Reaction by prof. Kris Inwood, Department of Economics, University of Guelph

Lisa,

I have just now seen reference to your missing population and the proposal to oversample closely matching areas. I send my first reaction, literally after fifteen minutes reflection, because you asked some weeks ago, and the clock is ticking.

My starting point is to think about the issue in a broader context of why you would want to do anything at all. One reason, as you say, is to produce a more representative sample which would allow users to reduce bias in summary tables and hypothesis testing. You do so by extra-weighting certain regions because they are known to be similar in some way to the regions for which districts are missing.

My reaction would be different if this were a simple process. In fact, this is likely to be quite complex, if I understand correctly (possibly I do not). Probably you will not be able to assess and develop a strategy for representativeness defined in terms of relationships between characteristics. Rather, you will work with means or other simple chracteristics of univariate distributions. Even with this simplified approach it seems likely that the extra-weighting needed to ensure representativeness for one characteristic will differ from that needed for another (eg population density vs age structure vs birthplace profile vs settlement age etc). Hence the need to average in some way the specialized weighting schemes in order to produce a composite index which guides the oversampling of some areas.

One disadvantage is that you will not know ahead of time if the resulting sample with enhanced density from select regions is more representative in any one dimension. It will be appropriate for your index of characteristics, because it was constructed with this purpose. However, it might be more biased than the original for some or even all of the individual characteristics (that were averaged to construct the index). It would be possible to test this, as Gordon suggests, although the results from a test in one local area may not be representative of other regions.

A further complication is that future users of the data will be testing a wide variety of hypotheses each with its own representativeness concern. You cannot anticipate all of them, and even if you could the averaging challenge will make it impossible to satisfy all of them. Indeed, the constructed sample may have biases that would not occur to you to check, but which might be highly inconvenient for specific research.

This line of argument seems to lead in the direct of Benoît's perspective. It might be just as useful for you to work at developing alternate weighting criteria that users, each to their own purpose, would find most useful. eg you publish the data as they survive, with a number of alternate weighting schemes. You will be able to give a clear explanation of each of the weighting schemes, and users will understand easily. eg given the known gaps of regions with known characteristics, there might be one set of weights for people who want to adjust for bias in the population density of the sample, other weights for those who are concerned to have representative age structure, birthplace, and so on.

I predict that users will understand this more easily than the representativeness implications of a complex index that is used for extra-weighting. In both cases you are explaining how to use weights based on known regional characteristics from published data to adjust for missing micro data. One approach, however, may be more transparent and perhaps also more flexible than the other.

If you were able to recycle the extra data entry time into expanding sample density across the board, so much the better. I am a big fan, as you know, of Steve [Ruggle']'s emphasis on maximum possible density.

All this with limited reflection, in order to give you some kind of reaction as quickly as possible.

I had not realized so much was missing from 1851-52. I wonder if Bruce Curtis might have come across a few stray returns during his long rummaging through Bureau of Agriculture files. I have a dim memory that he found additional returns, possibly 1861 rather than 1851-52.

Best wishes with this!
Kris

Last updated: 2/10/2021

Reaction by prof. Kris Inwood

The 1852 Census of Upper and Lower Canada: Proposed Oversampling Strategy, and Discussion

Reaction by prof. Kris Inwood, Department of Economics, University of Guelph

The 1852 Census of Upper and Lower Canada:
Proposed Oversampling Strategy, and Discussion