KNOWLES / KNOLES / NOLES
Family  Association
 

HOME PAGE BACKGROUND MEMBERSHIP GENEALOGY GENETICS REUNIONS
OFFICERS BYLAWS LIBRARY PROGENITORS PHOTOGRAPHS AFFILIATES

 

GENEALOGY
Tips  &  Tricks

SOUNDEX  SYSTEM
(A short guide to the U.S. (English) Soundex Code System)

Soundex Code Calculator

 


Introduction  to  the  Soundex  Coding  System

During your genealogical research you will occasional need to depend on the Soundex System.  The most prominent use of the Soundex System has been the method used for indexing names in the 1880, 1900, 1910, 1920 and 1930 U.S. Population Censuses.  Soundex can also aid genealogists in many other situations by identifying the spelling variations for a given surname. 

Nothing is more frustrating to a genealogist than to look through an alphabetic index of records for a particular surname and not be able to locate what is believed to be there, only to find out later, perhaps years later, that the data was there, but the entry was misspelled or spelled quite differently.  And how do you locate towns of immigrant non-English-speaking ancestors when the only information available was passed down through the generations orally and no matter how you try to spell it, you cannot find the town on a map.

A major solution to these indexing problems was provided many years ago when Robert C. Russell of Pittsburgh, Pennsylvania was issued patent number 1,261,167 on April 2, 1918.  Russell's patent was granted for having, "invented certain new and useful Improvements in Indexes ... as will as enable others skilled in the art to which it appertains to make and use the same."  With Russell's patent, the idea of indexing information by how it sounds rather than alphabetically was born.  Russell's patented indexing system has become known simply as: 'soundexing'.

Soundexing is a hashing system for English words.  The Soundex Code for an English word (or names) involves generating a letter and three numbers to represent a word, that roughly describes how any given word sounds.  Similar sounding words will have the same or similar codes.  The use of the code in a filing (and retrieving) system keeps together words of the same and similar sounds that have variant spellings.  The Soundex Code is often used by 411 systems (phone information), and by some state driver's license databases for looking up alternate spellings of a last name.   The Soundex Code has been used by the United States Census Bureau for finding similar names in census records.  Some genealogical database computer programs and some genealogical information Web sites [including ancestry.com and familysearch.org (LDS)] use a Soundex System for people searches.   More of the digital genealogical tools will use some sort of soundexing in the future.

In Russell's patent, he said, "There are certain sounds which form the nucleus of the English language, and those sounds are inadequately represented merely by the letters of the alphabet, as one sound may sometimes be represented by more than one letter or combination of letters, and one letter or combination of letters may represent two or more sounds.  Because of this, a great many names may have two or more different spellings which in an alphabetic index, or an index which separates names according to the sequence of their contained letters in the alphabet, necessitates their filing in widely separate places."

Russell knew that the letters of the alphabet were divided, phonetically, into distinct categories.  To each category, he assigned a numeric value.  His patent describes the sound categories of the letters in the alphabet as follows:

  • The vowels (Russell called them 'oral resonants'):  a, e, i, o, u, y

  • The labials and labio-dentals:  b, f, p, v

  • The gutturals and sibilants:  c, g, k, q, s, x, z

  • The dental-mutes:  d, t

  • The palatal-fricative:  l

  • The labio-nasal:  m

  • The dental to or lingua-nasal:  n

  • The dental fricative:  r

Through the years, Russell's system has been improved upon.  Those familiar with the current Soundex System used by the U.S. government will see that the original system was changed by combining the letters `m' and `n', dropping vowels all together unless it's the initial letter of the word, and dropping the rule regarding 'gh' and words that end with 's' or z'.
 



Definitions  of  Speech  Sounds

 

bullet

Dentals  -  speech sounds articulated with the tongue tip touching the back of the upper front teeth or immediately above them.

bullet

Fricative (also called spirant)  -  speech sounds characterized by audible friction produced by forcing the breath through a constricted or partially obstructed passage in the vocal tract.

bullet

Gutturals  -  speech sounds articulated in the back of the mouth or throat.

bullet

Labials  -  speech sounds articulated primarily by the lips.

bullet

Nasals  -  speech sounds pronounced with the voice issuing through the nose, either partly as in French nasal vowels, on entirely as in the letters 'm' and 'n' or the 'ng' in song.

bullet

Palatals  -  speech sounds articulated with the blade of the tongue held close to or touching the hard palate.

bullet

Resonant  -  a vowel or a voiced consonant or semivowel that is neither a stop nor an affricate (a speech sound released slowly as in either of the 'ch' sounds in church) and in 'l' 'm', 'ng', 'n', 'r', 'w' and 'y'.

bullet

Sibilants  -  speech sounds characterized by a hissing sound, like this spelled with 's' as in the words: this, rose, pressure, etc. and similar uses of: 'ch', 'sh', 'z', 'zh', etc.

bullet

Vowels  -  speech sounds produced without occluding, diverting, or obstructing the flow of air from the lungs.

 

 


Using  the  U.S. (English)  Soundex  Code  System

The first letter of the Soundex Code is simply the first letter of the word.  The remaining numbers ranging from 1 to 6,  define different categories of sounds created by the consonants that follow the first letter of the word.  If the word is too short to generate 3 numbers, 0s are added as needed.  If the generated code is longer than 3 numbers, the extra numbers are thrown away.
 

CODE LETTERS DESCRIPTION
1 B,  F,  P,  V Labials
2 C,  G,  J,  K,  Q,  S,  X,  Z Gutterals & Sibilants
3 D,  T Dental
4 L Long Liquid
5 M,  N Nasal
6 R Short Liquid
SKIPPED A,  E,  H,  I,  O,  U,  W,  Y Vowels plus H,  W,  & Y

There are a number of special cases when calculating a Soundex Code:
bullet

Letters with the same Soundex number that are immediately next to each other are discarded.  So Pfizer becomes Pizer (coded as P-260), Lloyd becomes Loyd (coded as L-300), Sack becomes Sac (coded as S-200), Czar becomes Car (coded as C-600), Kelly becomes Kely (coded as K-400), and Schaefer becomes Shaefer (coded as S-160).

bullet

The letter combinations:  'gh,' and 's' or 'z' may be discarded, if they are on the end of the word.

bullet

The letters 'h' and 'w' are completely disregarded except as initial letters for the name.

bullet

If two letters with the same Soundex number are separated by "H" or "W", the code uses only the first letter.   So Ashcroft is treated as Ashroft (coded as A-261).

bullet

Surname prefixes such as van, von, Di, de, le, D', dela, or du are sometimes disregarded in alphabetizing and in coding (try such names both ways).

bullet

Mc and Mac are NOT considered Prefixes.

Note: The NARA reminds us that it is important to remember that not all Bureau of Census employees involved with Soundexing names, strictly followed the rules.  For example, Ashcroft should have coded as A-261, but may have been coded as A-226 in some cases.

 

Examples

NAME LETTERS CODE
Beha B - B - 000
Bicksler B - csl B - 224
Johnson J - nsn J - 525
Knoles* K - nls K - 542
Knowles * K - nls K - 542
Marvel M - rvl M - 614
Noel N - l N - 400
Noles * N - ls N - 420
Prettyman ** P - rtm P - 635
White W - t W - 300

 * Note the code for Knowles & Noles is quite different (illustrating one of the problems with soundexing).
**  Prettyman shares the same code with: Parten, Portwine, Pridemore, Prudden & Purdom
(also illustrating one of the problems with soundexing).

 


 

Soundex  Code  Calculator

Use the Soundex Code calculator below (provided by RootsWeb) to determine the code for your surnames of interest (if calculating the code by hand is not your thing).  This Soundex search form will return the Soundex Code for the surname entered, plus identify other surnames or spellings sharing the same Soundex Code. 

Surname:    

 

  Return to Top


U.S.  Population  CENSUS  use  of  Soundex

(per the National Archives and Records Administration)

1880  -  The 1880 Census is indexed only for those families with children aged 10 years or younger.


1890  -  A Department of Commerce fire in 1921 destroyed most of the 1890 Census.  Although there is no Soundex, there is an alphabetical index for the small percentage of population schedules that survived the fire.


1900  -  There is a Soundex Index for all states.


1910  -  There is a Soundex Index for only the following states:

Alabama Kentucky Oklahoma
Arkansas Louisiana Pennsylvania
California Michigan South Carolina
Florida Mississippi Tennessee
Georgia Missouri Texas
Illinois North Carolina Virginia
Kansas Ohio West Virginia

1920  -  There is a Soundex Index for all states.


1930  -  There is a Soundex Index for only the following states:

Alabama Kentucky  (part) * South Carolina
Arkansas Louisiana Tennessee
Florida Mississippi Virginia
Georgia North Carolina West Virginia  (part) **

* These Kentucky counties are soundexed:
Bell Floyd, Harlan, Kenton, Muhlenberg, Perry and Pike.

** These West Virginia counties are soundexed: 
Fayette, Harrison, Kanawha, Logan, McDowell, Mercer and Raleigh.


1940 and Later  -  These Censuses are not yet available to the public because of legislation requiring a 72-year delay in their release.  In any event, there are no Soundex Indexes after 1930.  The Soundexing project was a WPA Project.


 

Problems  with  the U.S. (English)  Soundex  Code  System

 

The term “Soundex” is a generic term that covers many variations of an algorithm for categorizing (not searching) names that was first patented in 1918.  Most variations work the same in that they convert the name into a code, or 'key', consisting of the first letter, followed by several numbers that are assigned based upon a pre-determined grouping of consonants.  The name Soundex implies that the algorithm is a highly accurate phonetic matching algorithm, which is not the case at all.  Consider the results of these studies:
bullet

Only 33% of the matches that would be returned by Soundex would be correct.  Even more significant was the finding that fully 25% of correct matches would fail to be discovered by Soundex.  (Alan Stanier, September 1990, Computers in Genealogy, Vol. 3, No. 7)

bullet

Only 36.37% of Soundex returns were correct, while more than 60% of correct names were never returned by Soundex. (A.J. Lait and B. Randell, 1996)

 

A bit of warning about Soundex Codes and their use.  In theory the code should always be the same for a given name; however, in actual practice they sometimes vary.  There are a number of reasons for this.  Sometimes implementations of the algorithm has bugs that only become apparent in a small number of cases  (there are a number of implementations with bugs).  Sometimes last names are entered into a system or database incorrectly.  In addition, the Soundex System is basically English oriented.  There is no support for characters beyond the 26 letters used in the English language.  As a result, names with unusual letters (like ć, ř, or Đ) are sometimes encoded different ways by different people and programs.


Are you considering using Soundex for anything important?  You might want to think again.  Soundex is actually a pretty poor algorithm for doing fuzzy name comparisons.  Newer, albeit more complicated coding systems may be needed for your application. 

 

The ten (10) major problems with the English based Soundex System as implemented in the U.S. (and with other key-based name match solutions) that you might encounter when looking for genealogical records are:


1. Poor Precision producing MANY FALSE POSITIVES!  -  False positives are the nemesis of any name searching application because for each legitimate match candidate, the user must sift through many useless ones.  Soundex and other key-based algorithms are well known for their generation of an intolerably high percentage of false positives.


2. Dependence on Initial Letter  -  Key-based algorithms rely on the initial letter to generate the key.  Someone looking for Korbin may enter Corbin;  someone looking for Kreighton may enter Creiton; and someone looking for Noles may enter Knowles.  Think of all of the matches that will never be found.


3. Poor Handling of Multi-Cultural Names  -  Will your research or application need to process ethnically diverse names?  If so, you will likely be very disappointed with the results from key-based approaches.  Searching ethnically diverse names is a multi-dimensional problem that must be addressed differently depending upon the type of name being handled.  Treating every name as a string of characters from left to right is a flawed approach.


4. Unranked, Unordered Returns  -  Key-based approaches such as Soundex do not have the ability to intelligently rank a set of candidate matches or to measure the degree of similarity between a pair of names.  Some Soundex variations have simplistic algorithms that attempt to rank similarity based upon such things as the number of matched characters, but tests have shown these to be extremely flawed for name searching.


5. Poor Handling of Name Syntax Variation  -  The “first-middle-last” database model used in most databases can cause syntax issues for name searching, especially with ethnically diverse names.  Name components may appear in the wrong fields and, if more than three name components exists, they may be concatenated into a single field.  Key-based approaches do not do an adequate job of handling these issues.


6. Poor Handling of Name Equivalence  -  Many names have related forms that cannot be searched using a compression algorithm or fuzzy match logic.  A simple example is Peggy and Margaret.  Some Soundex variations have been enhanced with some rudimentary name tables but none have the breadth of coverage necessary to do an adequate job of processing ethnically diverse names.


7. Noise Intolerance  -  Keying errors and other noise problems are unavoidable when processing name data.  However, because there is no “correct” spelling of names, in general, they are resistant to data-validation and data-cleansing techniques.  If a database record contains the name “Msith”, key-based approaches such as Soundex are helpless at identifying this as a potential variation for “Smith”.


8. Poor Handling of Name Particles  -  Many cultures have name particles that wreak havoc on Anglo-centric and key-based search algorithms.  Abdul Rajeeb may also be represented as Abdel Raqeeb, Abd Al Ragib, Abderaqib, Abdurra’ib, and many other ways.  Including a name particle such as Abdul can be disastrous with key-based approaches.


9. Poor Handling of Phonetic Variations  -  Despite having the name Soundex, these types of approaches perform poorly on names with true phonetic variations such as Leighton / Layton,  Phifer / Pheiffer / Fifer,  and  Coghburn /Coburn.


10. Poor Handling of Abbreviations and Initials  -  Real world applications, especially those that contain name data transcribed from written documents, are rife with abbreviated names and initials.  Records for James Peter Jones and J.P. Jones may well be the same person, but will not be discovered by most search algorithms.  If you intend to process ethnically-diverse data, consider that most Anglo-centric data entry and data query professionals would know how to handle the “MD” in John R. Smith, MD.  But would they know how to handle Md. Abdul Rahman?  Md. is a common abbreviation for Mohammed.  This is just one of thousands of examples across many cultures.

 


 

The Daitch-Mokotoff Soundex System


The latest significant improvement to soundexing is the Daitch-Mokotoff soundex system.  In 1985, Gary Mokotoff indexed the names of some 28,000 persons who legally changed their names while living in Palestine from 1921 to 1948, most of whom were Jews with Germanic or Slavic surnames.  Mokotoff discovered there were numerous spelling variants of the same basic surname and that the list needed to be soundexed.  Using the conventional U.S. government system, which is based on the Russell system, many Eastern European Jewish names which sound the same did not soundex the same.  The most prevalent were those names spelled interchangeably with the letter w or v, for example, the names Moskowitz and Moskovitz.

A modification to U.S. Soundex system was then created and published in the first issue of Avotaynu, the journal of Jewish genealogy, in an article titled "Proposal for a Jewish Soundex Code."  Randy Daitch read the Mokotoff article and expanded on the rules of the new system.  See Soundexing and Genealogy, by Gary Mokotoff for a discussion of the D-M Soundex System improvements over the conventional English system, the coding rules, the D-M coding chart, examples, etc.

 

      To Genealogy Tips & Tricks


   


HOME PAGE BACKGROUND MEMBERSHIP GENEALOGY GENETICS REUNIONS
OFFICERS BYLAWS LIBRARY PROGENITORS PHOTOGRAPHS AFFILIATES


Webmaster:  Robert B. Noles


           FREE 14 Day Subscription to Ancestry.com!            Genealogical.com                 

Date of last edit:   Monday, December 12, 2005
 © 2000-2006  R.B. Noles & Knowles/Knoles/Noles Family Association   All Rights Reserved