|
|
Error Detection & Verification Procedures | |
|
|
Error Detection & Verification Procedures (from old website) Data verification is another area in which considerable changes were made in response to our "real time" experience. Verification really includes two separate issues. One the insuring for consistency both in data coding and data entry. The other is to insure accuracy of coding and data entry. The latter was compounded in our 1821-22 project by the fact that the original data was written by hand by over thirty different officials, whose handwriting, use of abbreviation and other conventions vary widely. The actual physical condition of the documents themselves vary widely. This problem will be compounded for the 1838-42, 1850 censuses, because they were compiled by many more census takers, often one official for several blocks. This may increase the accuracy of the census data, but it also increases the variety of handwriting and conventions. Initially we were considering doing double entry for a series of selected variables for each cuartel. We experimented with a small cuartel, less than 500 cases and we wanted to see how time-consuming double entry would be. The process is simple. Two people enter the same data. Then using a process called "differencing" in EXCEL, one file is compared to the other file, and any differences are noted in a third file, which records "true" if the entries are the same and "false" if they are not. The third person then checks each false against the actual census manuscript, and determines which was right. All false indicators on the so-called "master" file are changed to the correct data and we have as error-free a product as humans can devise. The theory is fine, if clearly time consuming. The reality was somewhat different. Eventually we figured out that not only did the machine match the exact word or code, but it demanded that the exact spacing be duplicated as well! One persons absent-mindedly enters a space before or after the data and although both entries are exactly the same, one gets a false reading. It was clear to us that the time consumed doing double entry would be prohibitive. We considered using double entry for only a small group of variables but disregarded that compromise as wasteful of our labor resources with relatively little benefit in accuracy. In the end we have settled on traditional sight and oral verification. The first verification procedure is incidental. The data entry person enters the codes and the literal data. Here is the first accuracy check, as the data entry person catches a portion of the errors simply by being alert to anomalies. The most obvious takes place with regard to the sex of the individual. In Spanish an individual's given name usually (but not always) is gender specific. Francisco will be a male; Francisca a female. The word ending indicates the sex. The same would be true for marital status and ethnicity as well. However, both those variables were coded without regard to sex (to save key strokes), eliminating the use of gender-specific endings except in the case of the doncella (young lady). The second stage is a systematic review of all data entry, done in two phases. In the first phase a coder (or the Project Director) does a "sight review" of all literally-entered data for one cuartel (names, occupations, birth places, titles, type of residence) and sequential numbers such as household number and manuscript page. The reviewer compares a computer-generated hard copy of the data entry with the original manuscript page (not the enlarged copy), line for line. This is where errors in spelling, typos, missing cases are discovered. The reviewer is not the person who entered the data. Second, all numeric codes (marital status, ethnicity, social status, sex and age) on the hard copy are read aloud by one person, while a second person checks each against the coded version. This procedure is fast, if tedious, and our standard is 1000 cases an hour. The original data entry person then makes the corrections. The third stage comes when the separate cuartel files are "pulled" from EXCEL and merged into one SPSS. file. In a process often known as "cleaning" the data, "frequencies" ran on each variable and printed out. The frequencies are reviewed to spot any anomalies, inconsistences or errors. For example, the list of all surnames is 123 pages long. Illegible or faded text causes problems in data entry but using a standard source for Hispanic names, the names are entered.1 However, inevitably, mistakes occur. A bilingual data entry personnel or a bilingual coder checks for spelling errors introduced in the data entry process by reviewing each name, both given and surnames. At this stage, our principle is that there are no spelling errors in naming because the spelling in the nineteenth century was rather flexible in Spanish, as it was in English. One might spell Gonsales, say, with an s or with a z. Or Bargas with either a B or a V. We copy it exactly as the name was written on the page. In the second phase we will standardize the names but the original spelling will always be available in the archive file. Here, too, the reviewer will check the standard reference texts for given and surnames. If the name is clearly written (if obscure) and is found among the some twelve thousand Hispanic surnames, for example, than the spelling stands. If there is doubt induced by illegible handwriting, for example, than the reviewer will check the standard texts. If a name is found which appears to fit the portion of the name that is legible, the reviewer will do one of two things. If they are reasonable certain that they have found the correct name, they will record it as if there had been no problem. However, if the reviewer still retains some doubt as to the accuracy of the name found in the text that name will be entered in brackets, always used to indicate a GCP interpretation. If no name is found but the reviewer believes it is reasonable if not certain that the name as entered, or as corrected, is correct brackets are also used. If a name is only partially legible, and the reviewer (or data entry person) has no good guess as to the name, those letters are entered, along with dots for missing letters, and a question mark is placed at the end. Next the variables are checked through cross tabulation using sex as a variable because many codes are gender-specific. For example, a table of sex by marital status found 53 male "doncellas", which as you know is reserved for young ladies. The data was sorted by marital status and sex, and the offending doncellas were checked against the original manuscript for errors of sex or marital status. In particular, all given names by sex were checked for unlikely feminine or masculine names, as established by a standard reference work on given Hispanic names.2 We find that something approaching half the errors lay in estado and half in sex. But likely or not, if the sex of the individual could be verified by the letter endings of other variables, the name was kept as written. Overall some 350 pages of frequencies were verified in this fashion. Finally, after the verification process is finished a random sample of approximately 3000 records will be taken to established the project error rate for the Literal phase. After the "Constructed" variables are completely entered, another random sample of 3000 records is taken to provide the project rate of error on that phase. Rod Anderson 1. Lyman D. Platt, Hispanic Surnames and Family History (Baltimore: Genealogical Publishing Company, Inc., 1996). 2. Richard D. Woods, ed. Hispanic First Names. A Comprehensive Dictionary of 250 Years of Mexican-American Usage (Westport, CT: Greenwood Press, 1984).
|
|
||||||||||||||||||||||||||
|
|
|
|
© 2003 Florida
State University, historyweb@fsu.edu
| ||