fsu torches Training Procedures for Data-Entry Personnel

About the Guadalajara Censuses Project
Guadalajara: Background & History
Guide to Database Designing
The following procedures were developed by the GCP training advisor, Robert McCaa, and modified by the experience of the GCP during the project for 1821 and 1822. The ultimate success of this project depends upon accurately and efficiently computerizing Mexican census data written in nineteenth-century Spanish script. We have found that it is absolutely essential that the data entry personnel be at least bilingual and, if possible, a native speaker. In hiring our data entry personnel, we advertized for a bilingual person with data entry experience. We interviewed seventeen applications, all either bilingual or native speakers who were fluent in English. They were asked to read a typical paragraph in one of our census manuscripts, to test their ability to site read nineteenth century handwriting. Then we asked them to complete a data entry exercise from one of the easier districts to decipher, testing their data entry speed and accuracy. From the group, four finalists were chosen to be asked back for a second interview. Each candidate read a further, more difficult, text and entered a longer exercise from one of the districts. From that we selected an outstanding candidate, one familiar with handwritten Spanish, a native speaker (from Puerto Rico), and a fast, accurate data entry person. We intend to use the same procedure for selecting the two data entry personnel for this project. Also, from our experience, the most important characteristic is the ability to decipher hand-written Spanish, even over that of speed and accuracy. In the process of data entry, the longest and most difficult data are the literal data: given and surnames, occupations, birthplaces, titles. A native speaker appears to have a significant edge over persons who are bilingual but who grew up in a bilingual culture (such as in a Cuban household in Miami).

Having been selected, the individuals than will undergo a three day training session provided by our training expert, Prof. Robert McCaa. Although the personnel were chosen for their ability to read hand-written Spanish, they still need training in recognizing certain conventions of paleography, which variables are the most likely to be inaccurate, what assumptions the operators may make, and what they may not make, what conventions are used when words or parts of words are indecipherable, and a number of conventions which the GCP developed in the process of creating the database from the 1821 and 1822 censuses.

To reduce errors to a minimum, we will trade time for employees, using two data entry personnel for a longer period of time rather than twice as many assistants for half the time. Moreover the Project Director and Project Assistant will be assigned the task of verifying the work of the key-operators, generating daily error reports and developing additional instructions dealing with errors and the interpretation of hand-writing.

In addition to the data entry operators, the data coders who will also verify the data after it has been entered will receive workshop training on their responsibilities. The verifier will be trained to use the verification software--how to discover errors, correct them, and generate reports to improve the accuracy of the key-operators' work. Similar procedures have been utilized by project consultant Robert McCaa in Mexico and he found them to be surprisingly effective with as little as three days of training, followed by daily scrutiny of a sampling of individual operator's work.

The deciphering of handwriting that is almost two centuries old is neither automatic nor impossible. Structured instruction, using photocopies of carefully selected documents to be digitized, is the key to successful training. The training procedure begins with numbers as they appear in the documents because this accustoms trainees to the idea that transcription is like a puzzle or breaking a code. Arabic numerals are easily taken for granted, but even in the short span of a century and a half certain variation has occurred in how individual digits are formed. With a couple of hours instruction, an educated assistant can be taught to recognize and transcribe hand written digits with a high degree of accuracy.

Alphabetic text introduces additional problems: many more letters, literal use of abbreviations, and a wider range of plausible contexts. Certain variables occurring in the manuscript censuses will be transcribed literally by the key-operators, namely given and surnames, occupations and titles. In the training, the manuscript will be scanned to provide a sample of illustrative texts. The sample materials will provide the basis for teaching the literal digitizing of the manuscript censuses. To increase the assurance with which data are captured, both coders and data entry personnel will be instructed in the use of flags, to signal uncertainty, certainty (equivalent to "sic"), torn or smudged original, etc. After the training is finished and as data-collection proceeds, daily verification reports will enhance accuracy and perhaps speed. It is anticipated that at least one coder and perhaps several will have had experience with our data prior to the beginning of our project (having worked on the 1821 and 1822 project).

Digitizing data. With recent advances in microcomputer software, the actual operation of the data collection should be the easiest portion of the workshop. Collecting historical census data in literal form does not require sophisticated collection software. Assistants with some prior experience in the use of microcomputers should readily adjust to the needs of this project. To ensure accuracy, speed, and data integrity, instructional modules will be developed on keying data, correcting mistakes, saving records, safe-guarding files, advanced "mouse" techniques, and the like. We found from the experience of our 1821-22 project that both coders and data entry personnel will come to the project with most of these skills already in hand and will needed little training beyond how those skills fit into our project procedures.

These training procedures are similar to those developed some years ago at the University of Minnesota for the computerization of a 1% sample of the 1880 manuscript census of the United States (research funded by various federal agencies) and recently adapted for a five year federally funded project to computerize a sample of other federal historical censuses. The finished product, for the GCP, as for those carried out by the Historical Censuses Project at the University of Minnesota, is a literal transcription of the census data, computerized in a fixed-field format so that the user can read or analyze the data with the greatest confidence that the data-collection process kept errors to a minimum.




