fsu torches florida state university
fsu torches
fsu torches Database Description: An Introduction

  Campus Map
About the Guadalajara Censuses Project
Guadalajara: Background & History
Guide to Database Designing
CD ROM Information

Description of Database: An Introduction

The GCP database is a rectangular, "flat" file composed of 57,091 rows (cases or records) and 90 columns (variables). The rows are mainly individuals living in families, households or "group quarters" (convents, jails, etc.). Several hundred cases, however, are vacant houses, included because even the absence of residents is an aspect of urban life worth capturing. The variables are two types_alphanumeric or "string" variables, and numeric or coded variables. String variables contain data which were entered into the database as written on the census manuscript such as surnames and occupations. Numeric variables contain data or "values" which are coded for the ease of data entry and analysis. Examples are marital status and "calidad" (ethnicity or race).

Further, the string and numeric variables are divided into two types of variables: "literal" and "constructed."

Literal Variables.      The "Literal" variables are comprised of four types:

First are those that contain information which was written on the manuscript page by the census taker  or his scribe. Examples of these are cuartel number, residence type, name, age, occupation, etc.

Second are literal variables which contain data created by the GCP staff for ease of searching, sorting or otherwise manipulating the statistical information. Examples of those variables are Master Index, Index, Manuscript Page, Household Number and Person Sequence Number (sequence of all individuals in the household.

Third are those string variables which have been "paired" with a numeric coded variables to facilitate statistical analysis. The numeric variable of the pair is usually identified with a "2" at the end of the name. Examples of those types are Patria2 and Job2.

Fourth are the consolidated variables created by GCP staff for the convenience of our users. Examples of those variables are Age2, Estado2 (marital status), Calidad2 (race/ethnicity), Razaclas (race and class) and Birthplace (born "elsewhere" or born in Guadalajara).

Constructed Variables.      The "Constructed" variables are those variables created by the GCP staff from the literal data on the manuscript pages, but which were not specifically provided by the census takers. These are several types:

First are the those variables which can be directly inferred with a high degree of confidence from the data actually provided. Examples are sex (rarely actually stated but usually obvious from names, position in the household or the endings of other data such as occupation or marital status), race of spouse and the location of one's spouse, mother or father within the household.

A further example of this type of constructed variables would be the so_called "count" variables, which provide the number of servants, number of employed persons, number of males and females in the family and household, etc. These are household and/or family variables in which each member of the household/family receives the same numeric data.

More problematic are those variables which are based on interpretations of the data or specific assumptions used by GCP staff in the process of coding and/or entering the data into the database. The GCP believes that the importance of the information justifies our interpretations and assumptions. Examples of such constructed variables are household and family structure, position in the household/family and a variety of migration variables. Such variables were created based as far as possible on specific, documented, assumptions or procedures.

  First, we endeavored to select the more conservative of the various assumptions available.

  Second, we adhered to a set of rigidly maintained and documented assumptions governing our interpretations.

  Or, third, we provided data quality variables called "flag" variables to provide the user with a useable level of confidence.

Flag Variables.      The "flag" variables provide a specific measure of data quality which will enable the user to select the level of quality with which they prefer to work. The level of quality essentially contain the specific assumptions used by the coders to base their interpretation or they contain some other indications of a greater or less level of data quality. They are in essence the level of confidence one might expect from the data of each case. Statistical software packages generally provide a easy means to recode variables to allow the user to work only with those values with which they are comfortable. An example is the flag for migrant marital status (FMIGMAR). A "0" means that we are "absolutely" certain of our data because the years in residence were given by the census takers. A "1" meant we were "very certain" as the migrating couple had children born elsewhere and children born in Guadalajara. A "2" meant we were "somewhat certain" as the couple had children born elsewhere (but not in Guadalajara). A "3" meant we were "uncertain" as the couple had only children born in Guadalajara. A detailed explanation for this and other flag variables are provided in this codebook, below.

Other Assumptions and Interpretations.      There are other assumptions and interpretations for which "flags" are not available but which were based on specific, consistent data coding and data entry procedures. The most common were paleographic interpretations of illegible handwriting. If some letters were clear and others not, the latter were marked by dots. If the data entry operator (nearly always either fluent in, or a native speaker of, Spanish) believed they knew the intended word or name, they placed their assumption in brackets after the partially illegible word. Such brackets were maintained in the Archive File, but eliminated in the consolidated file. Interpretations of names were supported by reference to standard works on Hispanic names. Finally, all names and terms were reviewed by native speakers of Spanish in the process of data verification, and any anomalies checked against the original manuscript.1

In all cases, staff assumptions and procedures are outlined in the Codebook which follows. More extensive discussion can also be found in separate appendices to this codebook or online at the GCP web site. In sum, the GCP objective in coding and entering our data is to provide the user with a much useful information as possible. The principle followed in the inevitable issue of interpretation is, so far as practical, to either provide an accompanying "flag" variable, or to follow consistent, and generally conservative, documented procedures.

Numeric and String Variables.

  Numeric Variables. Variables that are either numeric in nature (e.g. age) or are codes created by the GCP for convenience of data entry and/or statistical analysis (e.g. household number and sex). In the Sex variable, the data values are presented as a code (Sex, 1=male and 2 = female). Numeric variables are those which can be predicted in advance.

  String (alpha_numeric) Variables. Variables that are entered into the database in their original written form. The most obvious examples are names, occupation and birthplace, all variables which cannot be coded in advance. To facilitate data analysis, we also have created additional "paired" numeric variables for those string variables which can be analyzed statistically. Such variables are identified with a "2" following the original variable name (e.g. Job2).
Missing Data.      If an individual's (case) data for any particular variable is not present and cannot be inferred logically, it is either recorded as unknown or as missing.

A "missing" Value.      A missing value is one which will never be known in any particular cuartel, such as ethnicity (calidad) for those cuartels which did not provide that data. Numeric and string variables handle missing data in different ways.

  Numeric Values.      For numeric values missing for all individuals in a given cuartel, the cell for that variable contains a dot (.), called a "systems missing"code. Any statistical analysis will include the systems missing cases, but will not include them in the "valid percent" column.

  String Values.      Missing data in string variables is represented by a blank. A blank is a legitimate value in a string variable and will be included under the "valid percent" column. However, in the numeric version of that variable (e.g. Job2), the blank usually will be converted to a systems missing dot.

An "unknown" Value.      An unknown value is one which can not be determined from the information provided in the census manuscript for t hat individual, even though generally that information was provided for other individuals in the cuartel, a situation, for example, common in marital status.

  Numeric Values.      Usually unknown numeric values will be coded a "0" with a value label of "unknown," "unable to determine" or "unclassifiable," depending on the circumstances. Occasionally, a "0" will represent both missing data in cuartels where normally that information is provided and in cuartels where no data is provided. Usually this is in the numeric version of a string variable (e.g. "Restype2").

  String Values.      Unknown values in string variables are usually written as "unknown."
An "assumed" Value.      If a value is missing but can be inferred from the data then it gets an "assumed" code. For example, children of parents listed as Spanish will be given an assumed Spanish code. See "Marital Status" below for our rules for providing "assumed" values in that variable.

Illegible Values.      Finally, data which was provided but which could not be read because it is too faded, illegible handwriting, etc. is usually represented by a 98 or a 998 for numeric variables and "illegible" in string variables.


1. The issue of data quality is an on_going discussion among researchers. For example, see Sean Townsend, Cressida Chappell and Oscar Struijvé, Digitising History, online at http://hds.essex.ac.uk/g2gp/digitising_history . We have, as far as possible and practical, followed procedures established or refined by the Integrated Public Use Microdata Series (IPUMS) at the University of Minnesota Historical Census Project, Steven Ruggles director. See Matthew Sobek and Steven Ruggles, "The IMPUS Project. An Update," Historical Methods (Summer 1999), vol. 32:3, pp. 102-09, and early publications in the same source.


Search WWW Search www.fsu.edu

fsu seal
fsu seal
FSU Home | Search | Arts & Sciences |
© 2003 Florida State University, historyweb@fsu.edu
florida state university
fsu seal