|
GENEALOGY
Tips & Tricks
SOUNDEX SYSTEM
(A short guide to the U.S. (English) Soundex Code System)
Soundex Code Calculator
Introduction to
the Soundex Coding System During your genealogical research you will
occasional need to depend on the Soundex System. The most prominent use of
the Soundex System has been the
method used for indexing names in the 1880, 1900, 1910, 1920 and
1930 U.S. Population Censuses. Soundex can also aid genealogists in many other
situations by identifying
the spelling variations for a given surname.
Nothing is more frustrating to a
genealogist than to look through an alphabetic index of records for a particular
surname and not be able to locate what is believed to be there, only to find out
later, perhaps years later, that the data was there, but the entry was
misspelled or spelled quite differently. And how do you locate towns of
immigrant non-English-speaking ancestors when the only information available was
passed down through the generations orally and no matter how you try to spell
it, you cannot find the town on a map.
A major solution to these indexing problems was provided many years ago when
Robert C. Russell of Pittsburgh, Pennsylvania was issued patent number 1,261,167
on April 2, 1918. Russell's patent was granted for having, "invented
certain new and useful Improvements in Indexes ... as will as enable others
skilled in the art to which it appertains to make and use the same." With
Russell's patent, the idea of indexing information by how it sounds rather than
alphabetically was born. Russell's patented indexing system has become
known simply as: 'soundexing'.
Soundexing is a hashing system for English words. The Soundex Code for an
English word (or names) involves generating a letter and three numbers to
represent a word, that roughly describes how any given
word sounds. Similar sounding words will have the same or similar codes.
The use of the code in a filing (and retrieving) system keeps together words of
the same and similar sounds that have variant spellings. The Soundex Code is often used by 411 systems (phone information), and by some
state driver's license databases for looking up alternate spellings of a last
name. The Soundex Code has been used by the United States Census Bureau for finding similar
names in census records. Some genealogical database computer programs and
some genealogical information Web sites [including ancestry.com and
familysearch.org (LDS)] use a Soundex System for people searches.
More of the digital genealogical tools will use some sort of soundexing in the
future. In Russell's patent, he
said, "There are certain sounds which form the nucleus of the English language,
and those sounds are inadequately represented merely by the letters of the
alphabet, as one sound may sometimes be represented by more than one letter or
combination of letters, and one letter or combination of letters may represent
two or more sounds. Because of this, a great many names may have two or
more different spellings which in an alphabetic index, or an index which
separates names according to the sequence of their contained letters in the
alphabet, necessitates their filing in widely separate places."
Russell knew that the letters of the alphabet were divided, phonetically, into
distinct categories. To each category, he assigned a numeric value.
His patent describes the sound categories of the letters in the alphabet as
follows:
-
The vowels (Russell called them 'oral resonants'):
a, e, i, o, u, y
-
The labials and labio-dentals:
b, f, p, v
-
The gutturals and sibilants:
c, g, k, q, s, x, z
-
The dental-mutes: d, t
-
The palatal-fricative: l
-
The labio-nasal: m
-
The dental to or lingua-nasal:
n
-
The dental fricative: r
Through the years, Russell's system has been improved upon. Those
familiar with the current Soundex System used by the U.S. government will see
that the original system was changed by combining the letters `m' and `n',
dropping vowels all together unless it's the initial letter of the word, and
dropping the rule regarding 'gh' and words that end with 's' or z'.
Definitions of Speech Sounds
 |
Dentals -
speech sounds articulated with the tongue tip touching the back of the
upper front teeth or immediately above them. |
 |
Fricative (also
called spirant) - speech sounds
characterized by audible friction produced by forcing the breath through
a constricted or partially obstructed passage in the vocal tract. |
 |
Gutturals -
speech sounds articulated in the back of the mouth or throat. |
 |
Labials -
speech sounds articulated primarily by the lips. |
 |
Nasals -
speech sounds pronounced with the voice issuing through the nose, either
partly as in French nasal vowels, on entirely as in the letters 'm' and
'n' or the 'ng' in song. |
 |
Palatals -
speech sounds articulated with the blade of the tongue held close to or
touching the hard palate. |
 |
Resonant
- a vowel or a voiced consonant or semivowel that is neither a
stop nor an affricate (a speech sound released slowly as in either of
the 'ch' sounds in church) and in 'l' 'm', 'ng', 'n', 'r', 'w' and 'y'.
|
 |
Sibilants
- speech sounds characterized by a hissing sound, like this
spelled with 's' as in the words: this, rose, pressure, etc. and
similar uses of: 'ch', 'sh', 'z', 'zh', etc. |
 |
Vowels -
speech sounds produced without occluding, diverting, or obstructing the
flow of air from the lungs. |
Using the U.S.
(English) Soundex Code System
The first letter of the Soundex Code is simply the first
letter of the word. The remaining numbers ranging from 1 to 6,
define different categories of sounds created by the consonants that follow the
first letter of the word. If the word is too short to generate 3 numbers,
0s are added as needed. If the generated code is longer than 3 numbers,
the extra numbers are thrown away.
| CODE |
LETTERS |
DESCRIPTION |
| 1 |
B, F, P, V |
Labials |
| 2 |
C, G, J, K, Q,
S, X, Z |
Gutterals & Sibilants |
| 3 |
D, T |
Dental |
| 4 |
L |
Long Liquid |
| 5 |
M, N |
Nasal |
| 6 |
R |
Short Liquid |
| SKIPPED |
A, E, H, I, O,
U, W, Y |
Vowels plus H, W, & Y |
There are a number of special cases when
calculating a Soundex Code:
 |
Letters with the same Soundex number that are immediately
next to each other are discarded. So Pfizer becomes Pizer (coded as
P-260), Lloyd becomes Loyd (coded as L-300), Sack
becomes Sac (coded as S-200), Czar becomes Car (coded as C-600), Kelly becomes
Kely (coded as K-400), and Schaefer becomes Shaefer (coded as S-160).
|
 |
The letter combinations: 'gh,' and 's' or 'z' may be discarded, if they are on the end of the word. |
 |
The letters 'h' and 'w' are completely disregarded except
as initial letters for the name. |
 |
If two letters with the same Soundex number are separated
by "H" or "W", the code uses only the first letter. So Ashcroft
is treated as Ashroft (coded as A-261). |
 |
Surname prefixes such as van, von, Di, de, le, D', dela,
or du are sometimes disregarded in alphabetizing and in coding (try such
names both ways). |
 |
Mc and Mac are NOT considered Prefixes. |
Note: The
NARA reminds us that it is important to remember that not all Bureau of Census
employees involved with Soundexing names, strictly followed the rules. For
example, Ashcroft should have coded as A-261, but may have been coded as A-226
in some cases.
Examples
| NAME |
LETTERS |
CODE |
| Beha |
B - |
B - 000 |
| Bicksler |
B - csl |
B - 224 |
| Johnson |
J - nsn |
J - 525 |
| Knoles* |
K - nls |
K - 542 |
| Knowles * |
K - nls |
K - 542 |
| Marvel |
M - rvl |
M - 614 |
| Noel |
N - l |
N - 400 |
| Noles * |
N - ls |
N - 420 |
| Prettyman ** |
P - rtm |
P - 635 |
| White |
W - t |
W - 300 |
*
Note the code for Knowles & Noles is quite
different (illustrating one of the problems with soundexing).
**
Prettyman shares the same code with: Parten, Portwine, Pridemore, Prudden &
Purdom
(also illustrating one of the problems with soundexing).
Soundex Code Calculator
Use the Soundex Code calculator below
(provided by RootsWeb) to determine the code for your surnames of interest (if
calculating the code by hand is not your thing). This Soundex search form will return the Soundex Code for the surname
entered, plus identify other
surnames or spellings sharing the same Soundex Code.
Return to Top
U.S. Population CENSUS use
of Soundex
(per the National Archives and Records Administration)
1880
- The 1880 Census is indexed only for those families with children
aged 10 years or younger.
1890
- A Department of Commerce fire in 1921 destroyed most of the 1890
Census. Although there is no Soundex, there is an alphabetical
index for the small percentage of population schedules that survived the
fire.
1900
- There is a Soundex Index for all states.
1910
- There is a Soundex Index for only the following states:
| Alabama |
Kentucky |
Oklahoma |
| Arkansas |
Louisiana |
Pennsylvania |
| California |
Michigan |
South Carolina |
| Florida |
Mississippi |
Tennessee |
| Georgia |
Missouri |
Texas |
| Illinois |
North Carolina |
Virginia |
| Kansas |
Ohio |
West Virginia |
1920
- There is a Soundex Index for all states.
1930
- There is a Soundex Index for only the following states:
| Alabama |
Kentucky (part) * |
South Carolina |
| Arkansas |
Louisiana |
Tennessee |
| Florida |
Mississippi |
Virginia |
| Georgia |
North Carolina |
West Virginia (part) ** |
* These Kentucky counties are soundexed:
Bell Floyd, Harlan, Kenton, Muhlenberg, Perry and Pike.
** These West Virginia counties are soundexed:
Fayette, Harrison, Kanawha, Logan, McDowell, Mercer and Raleigh.
1940
and Later
- These Censuses are not yet available to the public because of
legislation requiring a 72-year delay in their release. In any
event, there are no Soundex Indexes after 1930. The Soundexing
project was a WPA Project.
Problems with the U.S. (English)
Soundex Code System
The term “Soundex” is a generic term that covers many
variations of an algorithm for categorizing (not searching) names that was first
patented in 1918. Most variations work the same in that they convert the
name into a code, or 'key', consisting of the first letter, followed by several
numbers that are assigned based upon a pre-determined grouping of consonants.
The name Soundex implies that the algorithm is a highly accurate phonetic
matching algorithm, which is not the case at all. Consider the results of
these studies:
 |
Only 33% of the matches that would be returned by
Soundex would be correct. Even more significant was the finding
that fully 25% of correct matches would fail to be discovered by
Soundex. (Alan Stanier, September 1990, Computers in Genealogy,
Vol. 3, No. 7) |
 |
Only 36.37% of Soundex returns were correct, while
more than 60% of correct names were never returned by Soundex. (A.J.
Lait and B. Randell, 1996) |
A bit of warning about Soundex Codes and their use. In theory the
code should always be
the same for a given name; however, in actual practice they sometimes vary. There are a
number of reasons for this. Sometimes implementations of the algorithm has bugs
that only become apparent in a small number of cases (there are a number
of implementations with bugs). Sometimes last names are entered into a system
or database incorrectly. In addition, the Soundex System is basically English oriented.
There is no support for characters beyond the 26 letters used in the English
language. As a result, names with unusual letters (like ć, ř, or Đ) are
sometimes encoded different ways by different people and programs.
Are you considering using Soundex for anything important? You might want to
think again. Soundex is actually a pretty poor algorithm for doing fuzzy name
comparisons. Newer, albeit more complicated coding systems may be needed
for your application.
The ten (10) major problems with the English based Soundex
System as implemented in the U.S. (and with other key-based name match solutions)
that you might encounter when looking for genealogical records are:
1. Poor Precision producing MANY FALSE POSITIVES!
- False positives are the
nemesis of any name searching application because for each legitimate match
candidate, the user must sift through many useless ones. Soundex and other
key-based algorithms are well known for their generation of an intolerably
high percentage of false positives.
2. Dependence on Initial Letter
- Key-based algorithms rely on the initial
letter to generate the key. Someone looking for Korbin may enter Corbin; someone looking for Kreighton may enter Creiton;
and someone looking for Noles may enter Knowles. Think of all of the matches
that will never be found.
3. Poor Handling of Multi-Cultural Names
- Will your research or application need to
process ethnically diverse names? If so, you will likely be very
disappointed with the results from key-based approaches. Searching
ethnically diverse names is a multi-dimensional problem that must be
addressed differently depending upon the type of name being handled.
Treating every name as a string of characters from left to right is a flawed
approach.
4. Unranked, Unordered Returns
- Key-based approaches such as Soundex do not
have the ability to intelligently rank a set of candidate matches or to
measure the degree of similarity between a pair of names. Some Soundex
variations have simplistic algorithms that attempt to rank similarity based
upon such things as the number of matched characters, but tests have shown
these to be extremely flawed for name searching.
5. Poor Handling of Name Syntax Variation
- The “first-middle-last” database
model used in most databases can cause syntax issues for name searching,
especially with ethnically diverse names. Name components may appear in the
wrong fields and, if more than three name components exists, they may be
concatenated into a single field. Key-based approaches do not do an adequate
job of handling these issues.
6. Poor Handling of Name Equivalence
- Many names have related forms that
cannot be searched using a compression algorithm or fuzzy match logic. A
simple example is Peggy and Margaret. Some Soundex variations have been
enhanced with some rudimentary name tables but none have the breadth of
coverage necessary to do an adequate job of processing ethnically diverse
names.
7. Noise Intolerance
- Keying errors and other noise problems are
unavoidable when processing name data. However, because there is no
“correct” spelling of names, in general, they are resistant to
data-validation and data-cleansing techniques. If a database record contains
the name “Msith”, key-based approaches such as Soundex are helpless at
identifying this as a potential variation for “Smith”.
8. Poor Handling of Name Particles
- Many cultures have name particles that
wreak havoc on Anglo-centric and key-based search algorithms. Abdul Rajeeb
may also be represented as Abdel Raqeeb, Abd Al Ragib, Abderaqib, Abdurra’ib,
and many other ways. Including a name particle such as Abdul can be
disastrous with key-based approaches.
9. Poor Handling of Phonetic Variations
- Despite having the name Soundex,
these types of approaches perform poorly on names with true phonetic
variations such as Leighton / Layton, Phifer / Pheiffer / Fifer,
and Coghburn /Coburn.
10. Poor Handling of Abbreviations and Initials
- Real world applications,
especially those that contain name data transcribed from written documents,
are rife with abbreviated names and initials. Records for James Peter Jones
and J.P. Jones may well be the same person, but will not be discovered by
most search algorithms. If you intend to process ethnically-diverse data,
consider that most Anglo-centric data entry and data query professionals
would know how to handle the “MD” in John R. Smith, MD. But would they know
how to handle Md. Abdul Rahman? Md. is a common abbreviation for Mohammed. This is
just one of thousands of examples across many cultures.
The Daitch-Mokotoff Soundex System
The latest significant improvement to soundexing is the
Daitch-Mokotoff soundex system. In 1985, Gary Mokotoff indexed the
names of some 28,000 persons who legally changed their names while living in
Palestine from 1921 to 1948, most of whom were Jews with Germanic or Slavic
surnames. Mokotoff discovered there were numerous spelling variants of
the same basic surname and that the list needed to be soundexed. Using
the conventional U.S. government system, which is based on the Russell
system, many Eastern European Jewish names which sound the same did not
soundex the same. The most prevalent were those names spelled
interchangeably with the letter w or v, for example, the names Moskowitz and
Moskovitz.
A modification to U.S. Soundex system was then created and published in the
first issue of Avotaynu, the journal of Jewish genealogy, in an
article titled "Proposal for a Jewish Soundex Code." Randy
Daitch read the Mokotoff article and expanded on the rules of the new
system. See Soundexing
and Genealogy, by Gary Mokotoff for a discussion of the D-M Soundex
System improvements over the conventional English system, the coding rules,
the D-M coding chart, examples, etc.
To Genealogy Tips & Tricks | |
|