|
|

|
Training
Procedures for Data-Entry Personnel
(from old website)
The following procedures
were developed by the GCP training advisor, Robert McCaa, and modified
by the experience of the GCP during the project for 1821 and 1822.
The ultimate success of this project depends upon accurately and efficiently
computerizing Mexican census data written in nineteenth-century Spanish
script. We have found that it is absolutely essential that the data
entry personnel be at least bilingual and, if possible, a native speaker.
In hiring our data entry personnel, we advertized for a bilingual person
with data entry experience. We interviewed seventeen applications,
all either bilingual or native speakers who were fluent in English.
They were asked to read a typical paragraph in one of our census manuscripts,
to test their ability to site read nineteenth century handwriting.
Then we asked them to complete a data entry exercise from one of the
easier districts to decipher, testing their data entry speed and accuracy.
From the group, four finalists were chosen to be asked back for a second
interview. Each candidate read a further, more difficult, text and
entered a longer exercise from one of the districts. From that we selected
an outstanding candidate, one familiar with handwritten Spanish, a
native speaker (from Puerto Rico), and a fast, accurate data entry
person. We intend to use the same procedure for selecting the two data
entry personnel for this project. Also, from our experience, the most
important characteristic is the ability to decipher hand-written Spanish,
even over that of speed and accuracy. In the process of data entry,
the longest and most difficult data are the literal data: given and
surnames, occupations, birthplaces, titles. A native speaker appears
to have a significant edge over persons who are bilingual but who grew
up in a bilingual culture (such as in a Cuban household in Miami).
Having been selected,
the individuals than will undergo a three day training session provided
by our training expert, Prof. Robert McCaa. Although the personnel
were chosen for their ability to read hand-written Spanish, they still
need training in recognizing certain conventions of paleography, which
variables are the most likely to be inaccurate, what assumptions the
operators may make, and what they may not make, what conventions are
used when words or parts of words are indecipherable, and a number
of conventions which the GCP developed in the process of creating the
database from the 1821 and 1822 censuses.
To reduce errors
to a minimum, we will trade time for employees, using two data entry
personnel for a longer period of time rather than twice as many assistants
for half the time. Moreover the Project Director and Project Assistant
will be assigned the task of verifying the work of the key-operators,
generating daily error reports and developing additional instructions
dealing with errors and the interpretation of hand-writing.
In addition to the
data entry operators, the data coders who will also verify the data
after it has been entered will receive workshop training on their responsibilities.
The verifier will be trained to use the verification software--how
to discover errors, correct them, and generate reports to improve the
accuracy of the key-operators' work. Similar procedures have been utilized
by project consultant Robert McCaa in Mexico and he found them to be
surprisingly effective with as little as three days of training, followed
by daily scrutiny of a sampling of individual operator's work.
The deciphering of
handwriting that is almost two centuries old is neither automatic nor
impossible. Structured instruction, using photocopies of carefully
selected documents to be digitized, is the key to successful training.
The training procedure begins with numbers as they appear in the documents
because this accustoms trainees to the idea that transcription is like
a puzzle or breaking a code. Arabic numerals are easily taken for granted,
but even in the short span of a century and a half certain variation
has occurred in how individual digits are formed. With a couple of
hours instruction, an educated assistant can be taught to recognize
and transcribe hand written digits with a high degree of accuracy.
Alphabetic text introduces
additional problems: many more letters, literal use of abbreviations,
and a wider range of plausible contexts. Certain variables occurring
in the manuscript censuses will be transcribed literally by the key-operators,
namely given and surnames, occupations and titles. In the training,
the manuscript will be scanned to provide a sample of illustrative
texts. The sample materials will provide the basis for teaching the
literal digitizing of the manuscript censuses. To increase the assurance
with which data are captured, both coders and data entry personnel
will be instructed in the use of flags, to signal uncertainty, certainty
(equivalent to "sic"), torn or smudged original, etc. After
the training is finished and as data-collection proceeds, daily verification
reports will enhance accuracy and perhaps speed. It is anticipated
that at least one coder and perhaps several will have had experience
with our data prior to the beginning of our project (having worked
on the 1821 and 1822 project).
Digitizing data.
With recent advances in microcomputer software, the actual operation
of the data collection should be the easiest portion of the workshop.
Collecting historical census data in literal form does not require
sophisticated collection software. Assistants with some prior experience
in the use of microcomputers should readily adjust to the needs of
this project. To ensure accuracy, speed, and data integrity, instructional
modules will be developed on keying data, correcting mistakes, saving
records, safe-guarding files, advanced "mouse" techniques,
and the like. We found from the experience of our 1821-22 project that
both coders and data entry personnel will come to the project with
most of these skills already in hand and will needed little training
beyond how those skills fit into our project procedures.
These training procedures
are similar to those developed some years ago at the University of
Minnesota for the computerization of a 1% sample of the 1880 manuscript
census of the United States (research funded by various federal agencies)
and recently adapted for a five year federally funded project to computerize
a sample of other federal historical censuses. The finished product,
for the GCP, as for those carried out by the Historical Censuses Project
at the University of Minnesota, is a literal transcription of the census
data, computerized in a fixed-field format so that the user can read
or analyze the data with the greatest confidence that the data-collection
process kept errors to a minimum.
|
 |
|