Description of Database: An Introduction
GCP database is a rectangular, "flat" file composed of 57,091
rows (cases or records) and 90 columns (variables). The rows are mainly
living in families, households or "group quarters" (convents, jails,
etc.). Several hundred cases, however, are vacant houses, included because even
the absence of residents is an aspect of urban life worth capturing. The variables
are two types_alphanumeric or "string" variables, and numeric or coded
variables. String variables contain data which were entered into the database
as written on the census manuscript such as surnames and occupations. Numeric
variables contain data or "values" which are coded for the ease of
data entry and analysis. Examples are marital status and "calidad" (ethnicity
Further, the string and numeric variables are divided into two types of variables: "literal" and "constructed."
Literal Variables. The "Literal" variables are comprised of
are those that contain information which was written on the manuscript
page by the census taker or his scribe. Examples of these
are cuartel number, residence type, name, age, occupation, etc.
are literal variables which contain data created by the GCP staff
for ease of searching, sorting or otherwise manipulating the statistical
information. Examples of those variables are Master Index, Index,
Manuscript Page, Household Number and Person Sequence Number (sequence
of all individuals in the household.
are those string variables which have been "paired" with
a numeric coded variables to facilitate statistical analysis. The
of the pair is usually identified with a "2" at the end of the name.
Examples of those types are Patria2 and Job2.
are the consolidated variables created by GCP staff for the convenience
of our users. Examples of those variables are Age2, Estado2 (marital
status), Calidad2 (race/ethnicity), Razaclas (race and class) and
Birthplace (born "elsewhere" or
born in Guadalajara).
Variables. The "Constructed" variables
are those variables created by the GCP staff from the literal data
on the manuscript
pages, but which were not specifically provided by the census takers.
These are several types:
are the those variables which can be directly inferred with a high
degree of confidence from the data actually provided. Examples
are sex (rarely actually stated but usually obvious from names,
position in the household or the endings of other data such as
occupation or marital status), race of spouse and the location
of one's spouse, mother or father within the household.
further example of this type of constructed variables would be
so_called "count" variables, which provide the number of servants,
number of employed persons, number of males and females in the family and household,
etc. These are household and/or family variables in which each member of the
household/family receives the same numeric data.
problematic are those variables which are based on interpretations
of the data or specific assumptions used by GCP staff in the process
of coding and/or entering the data into the database. The GCP believes
that the importance of the information justifies our interpretations
and assumptions. Examples of such constructed variables are household
and family structure, position in the household/family and a variety
of migration variables. Such variables were created based as far
as possible on specific, documented, assumptions or procedures.
we endeavored to select the more conservative of the various assumptions
we adhered to a set of rigidly maintained and documented assumptions
governing our interpretations.
third, we provided data quality variables called "flag" variables
to provide the user with a useable level of confidence.
Flag Variables. The "flag" variables provide a specific measure
of data quality which will enable the user to select the level of quality
with which they prefer to work. The level of quality essentially contain
the specific assumptions used by the coders to base their interpretation
or they contain some other indications of a greater or less level of
data quality. They are in essence the level of confidence one might
expect from the data of each case. Statistical software packages generally
provide a easy means to recode variables to allow the user to work
only with those values with which they are comfortable. An example
is the flag for migrant marital status (FMIGMAR). A "0" means
that we are "absolutely" certain of our data because the
years in residence were given by the census takers. A "1" meant
we were "very certain" as the migrating couple had children
born elsewhere and children born in Guadalajara. A "2" meant
we were "somewhat certain" as the couple had children born
elsewhere (but not in Guadalajara). A "3" meant we were "uncertain" as
the couple had only children born in Guadalajara. A detailed explanation
for this and other flag variables are provided in this codebook, below.
Other Assumptions and Interpretations. There
are other assumptions and interpretations for which "flags" are
not available but which were based on specific, consistent data coding
and data entry procedures.
The most common were paleographic interpretations of illegible handwriting.
If some letters were clear and others not, the latter were marked by
dots. If the data entry operator (nearly always either fluent in, or
a native speaker of, Spanish) believed they knew the intended word
or name, they placed their assumption in brackets after the partially
illegible word. Such brackets were maintained in the Archive File,
but eliminated in the consolidated file. Interpretations of names were
supported by reference to standard works on Hispanic names. Finally,
all names and terms were reviewed by native speakers of Spanish in
the process of data verification, and any anomalies checked against
the original manuscript.1
In all cases, staff assumptions and procedures are outlined in the
Codebook which follows. More extensive discussion can also be found
appendices to this codebook or online at the GCP web site. In sum, the
GCP objective in coding and entering our data is to provide the user
with a much useful information as possible. The principle followed in
the inevitable issue of interpretation is, so far as practical, to either
provide an accompanying "flag" variable, or to follow consistent,
and generally conservative, documented procedures.
Numeric and String Variables.
Variables. Variables that are either numeric in nature (e.g. age)
or are codes created by the GCP for convenience of data entry and/or
statistical analysis (e.g. household number and sex). In the Sex
variable, the data values are presented as a code (Sex, 1=male
and 2 = female). Numeric variables are those which can be predicted
Missing Data. If
an individual's (case) data for any particular variable is not present
and cannot be inferred logically, it is either recorded
as unknown or as missing.
(alpha_numeric) Variables. Variables that are entered into the
database in their original written form. The most obvious examples
are names, occupation and birthplace, all variables which cannot
be coded in advance. To facilitate data analysis, we also have
created additional "paired" numeric variables for those
string variables which can be analyzed statistically. Such variables
are identified with a "2" following the original variable
name (e.g. Job2).
||A "missing" Value.
A missing value is one which will never be known in any particular
cuartel, such as ethnicity (calidad) for those cuartels which did
not provide that data. Numeric and string variables handle missing
data in different ways.
Values. For numeric values missing for all individuals in a given
cuartel, the cell for that variable contains a dot (.), called
missing"code. Any statistical analysis will include the systems missing
cases, but will not include them in the "valid percent" column.
Values. Missing data in string variables is represented by a blank.
A blank is a legitimate value in a string variable and will be
the "valid percent" column. However, in the numeric version of that
variable (e.g. Job2), the blank usually will be converted to a systems missing
||An "unknown" Value. An unknown value is one which can not be determined from the information
provided in the census manuscript for t hat individual, even though
generally that information was provided for other individuals in
the cuartel, a situation, for example, common in marital status.
Values. Usually unknown numeric values will be coded a "0" with
a value label of "unknown," "unable to determine" or "unclassifiable," depending
on the circumstances. Occasionally, a "0" will represent
both missing data in cuartels where normally that information is
provided and in cuartels where no data is provided. Usually this
is in the numeric version of a string
variable (e.g. "Restype2").
Values. Unknown values in string variables are usually written
||An "assumed" Value.
If a value is missing but can be inferred from the data then it
gets an "assumed" code. For example, children of parents
listed as Spanish will be given an assumed Spanish code. See "Marital
Status" below for our rules for providing "assumed" values
Values. Finally, data which was provided but which could not be
read because it is too faded, illegible handwriting, etc. is usually
by a 98 or a 998 for numeric variables and "illegible" in string variables.
1. The issue of data quality is an on_going discussion
among researchers. For example, see Sean Townsend, Cressida Chappell
and Oscar Struijvé,
Digitising History, online at http://hds.essex.ac.uk/g2gp/digitising_history . We have, as far as possible and practical, followed procedures established
or refined by the Integrated Public Use Microdata Series (IPUMS) at the
University of Minnesota Historical Census Project, Steven Ruggles director.
See Matthew Sobek and Steven Ruggles, "The IMPUS Project. An Update," Historical
Methods (Summer 1999), vol. 32:3, pp. 102-09, and early publications
in the same source.