PRDH 1881 :: Data Entry

T.O.C.

Data Entry

1881 Cleaning Manual, PHASE 1

Getting Started

The LDS volunteers entered geographic information into the data into just two columns, even though there are a number of geographic categories for the 1881 Census. In addition, sometimes they didn't enter the correct information and sometimes they didn't enter any at all.

Fortunately, we have the Geographic Index to the 1881 Census of Canada produced by the National Archives of Canada to help us restore the proper geographic information to the data. This Geographic Index (for which no electronic copy was available from the NAC) has been entered into an Excel file by Karli Bergquist, a Canadian Studies student, and I have given copies to each research student.

The 1881 Geographic Index contains all the necessary geographic information which we want on the file.

In 1881, sub-districts which were quite big got divided into two divisions, and the division number accompanies the sub-district letter e.g. a-1, a-2, a-3; in most instances, sub-district is not divided so you just get a, b, c. If there is a division, it signals the beginning of a new cluster of pages or volume. In the index, division is displayed as a part of sub-district.

The current 1881 data usually contains all this geographic information but unfortunately it is jumbled up in two fields, Header Source and District

Ideally, we want to create new categories (or "variables" or "fields") which mimick the geographic organization we find in the National Archives 1881 index: district number, district name, sub-district letter, division number, sub-district name. These fields in conjunction with microfilm reel number, page number and line number will create the correct overall sort order for the entire file. So in the process of straightening out the geographic information, we are also ensuring that all the data can put be in the same sort order as what is on the microfilm reel.

We are also adding a new field called "volume", which is basically a number for each sub-district (and division, where it exists). This number simply starts at 1 at the beginning of a new microfilm reel number and goes up with each new sub-district (and division, where it exists) until a new microfilm reel number starts (see the above table for an illustration of volume). We are adding volume because we can then put microfilm reel number, volume number, page number and line number together to get a unique identification number for every single individual in the database. (It is easier to work with all numeric fields when creating a unique identification number, or unique-id). (The microfilm reel number isn't numeric at the moment as it has that "C" in front, but we will probably remove that later).

I will be referring to volume a lot in this manual. Always remember that:

volume = sub-district or sub-district + division (where division exists)

I am creating volume for each microfilm reel ahead of time, and will put them in an Excel table of the 1881 Finding Aid that I will give each of you. You can always refer to this Excel finding aid table to see the volume numbers for the reel you are working on.

How the data is sorted

The LDS data is actually sorted by Record_ID, which is the same as sorting on:
Batch, Packet and Entry Number (in that order, left to right).

Record_ID is a number which was applied by the LDS to the data after they received it all from the LDS volunteers. Each LDS volunteer worked on perhaps half a sub-district. As the volunteers sent their data back to headquarters, it got put into a file. If the volunteers who were working on the first and second halves of a particular sub-district didn't finish at the same time, the first and second halves of this sub-district got added "into the file" at different times and therefore were not "put together" (at least, this is what it seems happened). Then when all the data was received by LDS headquarters, they applied this sequential Record_ID number to all the data, running from 1 to 4.3 million, 1 for every person in the database. As a result, Record_ID is useful for sorting people within sub-districts into family and page groupings, but it doesn't put all the sub-districts in order. However, it does put people in the database into useful "chunks".

Once you have identified the volume number, it is very useful to sort on volume, then record_id.

Try to learn how to sort the data in various ways to help you identify missing data as you go (more on this later).

The infamous dupe key

David has created a "DUPE" function key which we can use to input information that is the same as the line above (e.g. district or sub-district, which repeats for multiple pages in a row). To use the Dupe key, position the cursor in a cell below the cell with the information you want and press F3. This will work in any of the tables you are working in. You will find this will save you a lot of time.

GETTING STARTING

Step 1 - Initial Geographic Cleaning
Step 2 - Fix Geographies
Step 3 - Check Volume, Page and Line Numbers for IDs
Step 4 - Check for Missing Pages
Step 5 - Check for Missing Lines
Step 6 - Check Family Numbers

Follow the directions on "File Management" to get your MS ACCESS and MS EXCEL files from the I:1881 Drive, copy it onto the C: drive, rename them with today's date, and open them up.

The MS EXCEL file with your name on it contains the electronic 1881 Finding Aid pages which correspond to the data you will be working on. I would like you to have this file open as you are working & add notes to it as you go. We have proofread the electronic finding aid in MS Excel against the original hardcopy from the National Archives. However, it is always possible that a few more errors are hiding in there. If you find something "strange", double check the paper copy against the Excel finding aid.

If the program in MS ACCESS ever crashes (which we hope it doesn’t, but computers are computers) exit MS ACCESS and start all over again.

Open up your most recent file in MS ACCESS, select the reel you want to work on from the drop down list in the main menu and enter your initials.
*This form will and should stay open during the whole time you are working on your data.

Press Next.

Step 1 - Initial Geographic Cleaning

Step 1 opens up. Press "Edit Geography Dictionary".

In Step 1, we are doing some initial geographic cleaning to the "Geographic Dictionary". The Geographic Dictionary is simply a table which summarizes the unique occurrences of those two original LDS "Header Source" and "District" columns (it groups together all the people who share the same content or "values" of those two columns together). This table also contains a "Frequency" or "Freq" column which indicates how many people are grouped by those two columns (sometimes it's a lot of people, sometimes it's just a few).

Next to those columns are blank columns for the geographic fields that we want to have (Province, District Number, District Name, Sub-district Letter, Division Number, Sub-District Name, Volume).

Your job in Step 1 is to fill those blank columns in.

We're starting off with the assumption that the content of original LDS Header Source and District columns are correct (though some of them aren't and we'll have to correct that in Step 2).

Fill in values for district_name, district_number, subdistrict_name, subdistrict_number, division_number and volume based on your information from the original LDS Header Source and District columns, guided by the Finding Aid for that reel.

Don’t bother entering french accents into these columns for now. I know that sounds awful, but the LDS search engine software can't handle accents (they try to create a search engine that will work for all genealogists, no matter how antiquated their computers are) (plus they disseminate genealogy data all over the world and it gets too confusing to have accents). So for now, we are creating these fields without accents. We will have to add in the accents later to the French language entries.

Remember that in Step 1, you are really making suggestions in each of these columns for what you think the geographic information should be, on the basis of what the original Header-Source and District columns from the LDS say. You can always change these suggestions or guesses later. By cleaning up this dictionary table, we can go a long way toward cleaning the geographic fields in the main data table for each microfilm-reel-worth of data.

You will find some instances in which there is not enough information in the original geographic columns to guess at what to fill in the new columns. Usually this happens when there is enough information to guess district and sub-district, but you see on the Excel finding aid that that sub-district is further divided into two divisions, but you can't tell what the division is; the original LDS geographic columns simply don't say. If you know there should be a division number but you don't know what it is, then you can't assign the volume number.

Don't worry, you will figure out in Step 2 anything you weren't able to guess in Step 1.

One of our researchers, Betsey Baldwin, has devised a good way of dealing with those instances in which you just cannot guess what the volume number and division (and perhaps sub-district, or even district) should be:

She fills in any leftover blank spaces with a value which is "obviously wrong". So if volume is blank on one line, she puts "222". The next line in which volume is blank, she puts "333". The next line, "444" and so on. Do this as well with any other places where you couldn't guess the information. For example, for each single blank line in sub-district, you could put "aaa" "bbb" "ccc" and so on. The important thing is to make it an obviously wrong value, something you will recognize in Step 2. The obviously wrong value should also be different in each "cell" (or square box on the screen). If you have several, you might want to write them down, although you don't have do if you know you can search carefully for them in the next step.

The point of assigning "obviously-wrong values" in Step 1 is that the records (or people) represented in each horizontal line in the Step 1 table "go together": the geographic information available on them is not very good, but it is "bad for all of them" as a group, so assigning these "obviously-wrong values" keeps those people together, and allows you to make better guesses in Step 2.

After you have filled in as much as you can in this table and inserted obviously-wrong values for the rest, exit out of the table. MS ACCESS saves automatically, so don't worry about saving it yourself. If it prompts you to save the "design layout", always press "No".

** Whatever you do in this step, don't convert the data type of the fields other than "frequency" to numeric. It makes the ID program (one that gets run a little later) bomb. In later steps in this entire program, you will be asked to convert the data type of the fields, and at that point, you will say "Yes" to "save the design layout".

Now you want to add all the new columns you've filled in to the data itself. You do this by pressing "Make Edit Reel Table". Pressing this button launches a "module" which integrates the information from the Geography Dictionary into the main file.

Press "Next" to get to Step 2.

Step 2 - Fix Geographies

In this step, you will work more closely with the data to resolve any outstanding geographic classification problems you weren't able to resolve in Step 1 (e.g. you will fill in the correct geographic information for those "cases" or "records" (or people) for whom you weren't able to identify correct geographic information in Step 1).

You can see on the Step 2 Form four buttons (plus a "Back" and a "Next" button at the bottom of the form).

Two of these buttons open tables ("Open Geography Page Summary" and "Edit New Reel Table") and two of these buttons launch programs (or "modules") ("Generate Geog Page Summary" and "Merge Geog Updates").

In Step 2, your job is to assign the rest of the geographic information that you weren't able to assign in Step 1. You can do this either by using the Geography Page Summary or making changes directly in the actual data table, or a combination of both.

Press "Generate Geog Page Summary" to generate the Geographic Page Summary. Then press "Open Geography Page Summary" to see this table. This summary table summarizes the new geographic information you entered in Step 1 by page. There will be one line in this table for every page in the full data table, and on each line will be "district number", "district name", "sub-district letter", "division number", "sub-district name" and "volume". Since this table goes by the page, it is of course a good deal longer than the table you were working with in Step 1. However, it is shorter than the full data table which shows one line for every person.

For those pages for which full geographic information could not be identified in Step 1, their "cells" (or those little boxes on the screen) for the geographic columns will contain those "obviously wrong values". These obviously-wrong values are like mysteries that you are going to try to solve here, with the assistance of the page number information. You can fill in corrections in the "U" columns to the right of the screen, not in the original columns.

Move columns around so you have volume, page and line number side-by-side, and then sort on volume, page and line number. As with Step 1, don't convert the data type of the fields in the Geog Page Summary to numeric.

You will notice that some of the fields like Page Number are sorted in a weird way, like "1, 10, 100, 101, 102,..., 2, 20, 200, 201, 202,...". This is because these fields are currently defined as text (or alphabetic) fields rather than as numeric fields. This is because the Geography Page Merge program will not run properly if these fields are defined as "numeric" fields at this stage. For the Step 2 stage, you will have to get used to seeing the page numbers sorted as "1, 10, 100…". You can still see the general order of the pages and whether all the pages are there.

The first thing I want you to do in this table is deal with the volumes that you were able to identify in Step 1, checking off that all the volumes and every anticipated page number in each volume are present. Work against the Excel table for that reel. Go to volume 1, look up the page range for volume 1 in the Excel table, and look in the "Geog Page Summary" to see if all those pages are there (you'll have to scroll around a bit since the numbers are sorted alphabetically). If all the expected pages are there, put a "1" in the "All There!" column.

If partial pages are missing from a volume, indicate which pages are missing in the Excel column "Pages Missing". If an entire volume is missing, put a "1" in the column "Whole Volume Missing". Make any other notes at the bottom of the Excel table right in the file. This way we are certain that every page for each volume has been looked for, and we know which pages and/or entire volumes are missing. I can then forward these reports on to the LDS in case they need to enter more pages of data.

Once you have checked off the volumes on the Excel table, now work on any lines in the Geog Page Summary for which the required geographic information is still missing.

*** If you make any changes to the geographic columns, make them to the "U" version of the geographic columns (e.g. "U SUB DISTRICT LETTER") instead of to the original column. If you don't use the "U" column, when you go to integrate these changes into the main file, your new changes will be ignored.

Now that we can see what pages are associated with these "mystery" lines, you can use the expected page ranges for each volume given in the Excel table to try to guess what volume numbers (and sub-district letter, division number, sub-district name etc.) should be filled in for these mystery lines. For example, if you have successfully identified pages 1-30 of volume 1, and you know that the page range for volume 1 is 1-40, and you have a "mystery" chunk of lines going from page 31 to page 40, and you have identified all the other volumes, and none of the others are missing pages 31 to 40, then you can safely guess that those pages 31 to 40 belong to volume 1, and you can fill in the required geographic information for those pages 31 to 40. Make a note to yourself on the Excel table each time you make a guess like this.

The "obviously-wrong values" will "keep those chunks together" in a useful way. If you can guess a correct value for a "obviously-wrong value", you make the change in the U-column for that variable, and use the F3 DUPE Key to duplicate it to all the relevant pages. You may have to also make a change in the U-DIVISION column and perhaps U-Sub-District column.

*** Some of these mystery chunks (and some of the chunks you did identify in Step 1) may prove to be duplicate lines. You can tell if there are duplicates in the Geography Page Summary table if the "Freq" column (which indicates the number of people represented by that line) is 50 or more instead of 25 (since the Geography Page Summary has one line per page, and there are 25 lines per page on the census, there should be just "25" in the "Freq" column for each line in the Geography Page Summary). If you see lines with more than 25 or 26 in the Freq column, make a note in the Excel table; you will have to identify those duplicates in the full data table so we can remove them (see below).

You may not be able to solve all these mysteries at this stage, and will have to solve them in the full data table. Once you are done with the Geog Page Summary, close the table (again say no if it prompts you to save the design layout).

Press "Merge Geog Updates" to merge or integrate the changes you made in Geog Page Summary into the main data table.

Now click on "Edit New Reel Table" to see your main data table, with all the changes you have made so far in Steps 1 and 2 added in.

[ You do have the choice of not using the Geography Page Summary at all, and resolving all outstanding geographic "mysteries" in the full data table. There are only two differences in the two approaches. One is that the full data table is bigger (remember, one line for every person). Since the full data table is bigger, you have to do more scrolling up and down to determine the beginning and ending page ranges of the "mystery page chunks". Second, you make the changes directly in the original column, not in a U-column. Since you make the changes in the original column, you can use the automatic "search-and-replace" function (under Edit) to replace the obviously-wrong values with the correct value. ]

To see your new columns of geographic information, scroll all the way to the right. Remember that you can move these columns around to make it easier to compare them to each other. You want them to be beside page number, line number and family number and first & last names in particular.

You might want to move "Volume" and "Record_ID" beside each other and sort on them together to put the file in a reasonable order (this puts the newly-identified volumes together, plus it puts people within families and on pages, and then puts the pages themselves in the right order). (remember that the problem with sorting on page number or entry number etc. is that they are currently defined as alphabetic fields, so sorting on them puts them in that "1, 10, 100" order. But Record_ID has been defined as numeric from the get-go, so sorting on it puts those lines and pages back in order within volume).

Remember that MS ACCESS sorts on the field from left to right (so put Volume on the left and Record_ID on the right).

In this New Reel data table, you can resolve any remaining geographic mysteries you weren't able to resolve so far. This is because you can use Family Number and the Last Names of individuals to verify your guesses. ** Family Number is sequential through each volume. ** This will help you to match up missing chunks of data. You can also use "Record_ID" to match up chunks of data. You can always search for "the next or previous expected record_ID", to see if it is out there.

Remember that you are also operating by the process of elimination. You've first identified the volumes that are there, and when you are trying to identify the "mystery page chunks", first compare them to volumes in your Excel Finding Aid which have not yet been identified as "All There!"

Also, please revisit those "guesses" you made just above in the Geog Page Summary Table, just to make sure that you guessed right. It will be much easier to see that you guessed right by looking at the full data.

In your guess up above that those mystery pages 31 to 40 belonged in Volume 1, look to see if the family number on page 30 in Volume 1 "continues on" onto page 31 -- it should be sequential. You can use the last name of individuals in a family the same way, if a new page starts in the middle of a family.

*** There is always the possibility that there are duplicate chunks in the file. Sometimes different LDS volunteers accidentally got assigned the same sub-district section to enter, so we end up with two versions of some parts of the 1881 census in the file. There is also the possibility that whole volumes were missed by the LDS altogether.
*** If you identify some duplicate records (duplicate people), record on the Excel table the beginning and ending Record_ID numbers for the duplicated lines. Remember that Record_ID is a numbering system that the LDS created, that goes from 1 to 4.3 million all the way through the file. This way, David can later remove all the duplicated people and put them in an "Extras" File.

If you still have some chunks of data for which you simply could not nail down their precise geographic information, you're going to have to look them up on the microfilm. Usually this doesn't happen! Focus first on the volumes which you have not yet identified (perhaps there are just a few). Observe the number of the first family in that chunk in the data, and observe the first and last name of the first person, load up the microfilm, scroll to the first not-yet-identified-volume (sub-district and division), and see if that person is there. If so, check the last person in the "mystery page chunk" in the data against the microfilm just to be sure, and enter the correct geographic information right in the data table.

As you work on correcting the geographic information in the data table, you may see some errors in the page number, line number and family number fields; they will often just pop out at you. Go ahead and correct them as you see them, but don't worry too much about them at this stage. Page and line number errors are instead corrected at Step 3. If you see information in a cell which belong somewhere else (like a first and last name in the page field), fix it then and there (check to see if the first and last names are missing and enter them if they are).

When you have finished identifying all the geographic information in Step 2, close the table and press Next to proceed to Step 3.

Step 3 - Check Volume, Page and Line Numbers for IDs

This is the step in which we check for invalid page and line numbers. We know that the page numbers should go from 1 to wherever they stop for the given page range for each volume. We know that all line numbers should go from 1 to 25, as there are only 25 lines per page.

Press "Generate ID Look-Up" to launch the program which searches the data table for erroneous or missing volume, page and line numbers. Then press "Open ID Lookup" to view those erroneous or missing volume, page and line numbers. You can print this list off and close the table, or you can keep this table open as a long horizontal rectangle at the bottom of the screen and have the full data table open as another bigger rectangle above it. As in Step 2, press "Edit New Reel Table" to view the data itself. Notice that the ID Lookup table shows the page and line numbers for each volume. This way, you can search for these errors by each volume.

** Note that ACCESS has a "filter" function at the top of the screen which allows it to display only certain records at a time. This can help make your searches go faster. For example, if you are starting with incorrect page numbers in volume 1, you can position the cursor on the "1" in the "Volume" field, click on the icon with the "filter" and the little yellow lightning bolt -- then you "see" only those cases (or records, or people) with a value of "1" in the Volume field. To get everyone back again, press the icon with just the "filter" symbol (it is called "Remove Filter", and it is sort of highlighted while you have a filter on).

Try sorting your file first on Volume, then on Record_ID. This puts the data within volumes together in the order it was entered. This makes it a lot easier to understand what the source of errors might be.

Search for and correct the page and line number errors. They are usually really easy to fix. You should be correcting them in the context of all the surrounding information--e.g. is it a "23" in the middle of a "1, 2, 3, 23, 5, 6, 7" run of numbers? Is it obvious that it was just a 4 missing? Perhaps that "23" was actually the household number (and if so, is the adjacent household number wrong)? Often the data entry volunteers entered data one column over for 2 or 3 fields, and this is obvious once you look at the data.

Verify your corrections by looking over to make sure that your correction "fits" in the context of the family number and the last names of the people you are putting the page number "between".

Once you are done, you are now ready to change the field type of the volume number and the page number fields from text to numeric. This is not hard to do: at the top of the ACCESS screen, click on "View", then on "Design View". You will see a list of Field Names and Data Types that just explain how each field is classified. Click on the word "Text" next to volume number and use the drop-down list to change it from text to "Number". Do the same with Page Number. Now click on "View", then on "Design View" to go back to the data table. When you are prompted to save this time say yes (because we have deliberately made these changes in the field types). (We have to make this change from Text to Number at this point because otherwise, the missing pages table in Step 4 will be blank).

When you are done, click on "Next".

Step 4 - Checking for Missing Pages

Now you are ready to check for missing pages. Click on "Generate Missing Pages Table". Then click on "Open Missing Pages Table" to view which pages are missing. If the program crash, it's possible you forgot to change these fields to numeric at the end of Step 3.

You can print off the list, or keep this table open. As before, click on "Edit New Reel Table" to open the full data table and find the blanks in the actual data. If you make a bunch of changes and want to see how many you've eliminated, close the "New Reel Table", click once again on "Generate Missing Pages Table" and click on "Open Missing Pages Table". Make sure all tables are closed before you click on "Generate Missing Pages Table".

** Remember in the full data table to sort on Volume, then Record_ID, which helps you see errors.

When you are totally finished with this step, close the "Missing Page Table" but keep the full data table open. You are now required to change the format of the line number column to "number". Click on "View - Design View". Locate the line number field on the left side of the box at the top. On the right side, click on the drop-down list and change it from text to numeric.

When you are finished with this step, press on the "Next" button.

Step 5 - Check Missing Lines

This step works much like Step 4, only this time it is checking for missing lines.

Like Step 4, you can print off the missing line report or you can keep it open as you work on the full data table. I find it most handy to keep it open as you work.

Remember to sort on Volume, then Record_ID. Often it is not the line number that is wrong, it is the page number, and this sort order puts things "in the order you need in order to observe that it is actually the page number that was wrong.

You might ask, "How can there be any wrong page numbers? We've already corrected invalid and missing page numbers"? The problem is, in Steps 3 & 4, the program can pick up invalid page numbers like "/4" and missing ones, but if the data entry volunteer accidentally entered "10" instead of "1" in the page number field, this will not show up as an error until now. The computer thinks the "10" is fine. However, when the computer sorts the data for Step 5, it puts this person in with the page 10s and not with the 1s. Then when it looks for missing lines, it will discover that page "1" is missing a line "20". (Instead, page 10 will have two 20s). If you sort on Volume, then Record_ID, this will put that person back up with the page 1's, and you will see, "Aha! It's not that page 1 line 20 is missing, it's that this page number got an extra zero and I can take that off now!"

When you are totally finished with this step, close the "Missing Lines Table" but keep the full data table open. You are now required to change the format of the family number column to "number". Click on "View - Design View". Locate the family number field on the left side of the box at the top. On the right side, click on the drop-down list and change it from text to numeric.

When you are finished with this step, press on the "Next" button.

Step 6 - Check Family Numbers

In this step, we check that family numbers are in numeric sequence. Family Number is a column on the original manuscript census, (to the right of another column called "House Number", which recorded which building people lived in). A new family number is given to each family -- family can include grandparents, servants, boarders, visitors, farm labourers alongside the parents and children.

Sometimes the LDS data entry volunteers entered family numbers out of sequence, and sometimes they didn't give children the same family number as their parents. That's why we have this step.

This Step works the same way as Steps 3, 4 and 5. Press "Generate Check Family Number Table". It can take a minute or so. Then press "Open Check Family Number Table".

*** Look at the bottom of that Check Family Number Table to see how many family errors have been generated. If there are more than 100, record that number on your Excel file with a note to me, and do not proceed further with this file. If there are 100 or less errors, then proceed to correct them.

You will be consulting the family numbers alongside first and last name, age, sex and marital status to get a sense of which people belong together in families. It is most straightforward to put children with the parents; it can be more difficult to figure out where "extra people" go. Sometimes you will end up giving a unique family number to one person, but remember that in the 19th century, people rarely lived all by themselves.

The main point in correcting the family number errors is to give all family members the same number. Occasionally you will feel that you are running "out of numbers"; just make sure that adjacent family members do not have the same number.

Last updated: 2/10/2021