Frequently Asked Questions about the USGS Thesaurus

What's a thesaurus?
A thesaurus is a type of controlled vocabulary, defined here as
A consistent collection of terms
chosen for specific purposes
with explicitly stated, logical constraints
on their intended meanings and relationships.

Terms represent concepts, but it is the concepts themselves and their relationships, not the terms, that constitute the thesaurus.

Terms are related to one another in three different ways:

Hierarchy
A term always has an "is a" relationship with its broader term (BT); a narrower term (NT) can always be said to be "a type of", "a part of", or "an instance of" the parent term.
Preference
For a given concept, one term is chosen as the preferred term or label, and is referred to as the descriptor. Other terms that refer to the same concept are referred to as lead-in terms or non-preferred terms. A non-preferred term is not necessarily a synonym of the preferred term.

We also support a USE-WITH relationship in which a non-preferred text is associated with two descriptors. An example is "soil pollution" USE "contamination and pollution" WITH "soil resources".

Generic relationships
Where concepts are related in some way that cannot be expressed as an "is a" sentence, the thesaurus simply connects one term to another without specifying the nature of the relationship. This is different from more elaborate knowledge-management systems such as topic maps or ontologies in which such generic relationships are always identified and categorized.
What are the components of a thesaurus term?

Each concept represented by a term in the thesaurus has several components that identify, explain, and relate the term to other terms:

Preferred term
This is the short label we use to identify the concept. In the USGS Thesaurus, it must be unique.
Scope note
A succinct explanation of the concept, written in plain language, that helps people understand what we mean by the term, and what we intend to include or exclude in our usage of the term. This is not a formal definition. Its purpose in our web interfaces is to help a user decide whether or not this is the term they want.
Non-preferred terms
Also called "lead-in" or "use-for" terms, these additional short texts mean the same thing for the purposes of the thesaurus. Some of these are synonyms, alternate phrasings, or even common misspellings. Others might represent slightly different concepts that, for the purposes of the thesaurus, can be included under the same preferred label.
Broader term
The thesaurus term immediately "above" this one in the hierarchy. Each term should have an "is a" relationship with its broader term--you should be able to say the term is a type of, part of, or instance of its broader term. In the USGS Thesaurus, each term may have only one broader term.
Narrower term
Terms for which this term is the broader term. Narrower terms represent more specific concepts that are types, parts, or instances of the current term. Each narrower term has all of these components as well.
Related term
Additional, "see also" thesaurus terms that are related to this one in some way but are not broader or narrower terms.
Annotated screen capture of the Keyword Finder interface, with annotations indicating the placement in the interface of the term components.
Is there only one thesaurus?
No. It is helpful to use different controlled vocabularies for the purposes they serve best. Science Topics used the USGS Thesaurus, the Alexandria Digital Library Feature Type thesaurus, and Common Geographic Areas. These, and other thesauri, are available through web services; interactive presentations are at https://apps.usgs.gov/thesaurus/tab-term.html and https://apps.usgs.gov/thesaurus/thesaurus-full.php.
What is the purpose of the USGS Thesaurus?
The USGS Thesaurus is specifically intended to help people outside USGS find information on USGS web sites without specific knowledge of the organizational structure and operations of the USGS. For those inside USGS, the thesaurus provides a source of consistent index terms that spans the full range of USGS activities; such terms can be used to refine or clarify labels, to support internet search, and the relationships among them suggest linkages across programs.
How do you determine what terms to add to the thesaurus?
Criteria to use to include or exclude a proposed term:
  • Concept should be distinct from other existing terms, particularly in how USGS uses it.
  • Concept needs to be in scope for USGS Thesaurus:
    • Focus on categorizing scientific information.
    • Things we study, and the ways we study them.
    • Avoid words that are too general, such as "research" or "natural"
    • Adjectives or adverbs are often not helpful, especially as isolated words, like "recurring".
    • Organizational terms should be used only if they refer to scientific work. Laudable goals such as "diverse workforce" would be appropriate for a different vocabulary.
  • Scope of term needs to be limited so that it can usefully distinguish collections of information, if everything could be assigned the term, it's not a good discriminator.
  • USGS has published scientific information about the concept.
  • Parenthetical qualifiers, if needed, should be one word, maaaaaaaybe two.
  • We should be able to identify two or three reports or datasets or other things that would be assigned the term.
  • Must not be more specific than necessary. If the broader term covers the concept well enough, a USE-FOR text might suffice.
  • Must be encapsulated in a small number of words, so that it doesn't require a paragraph to identify. The scope note can be used to explain further, but it should not be overly long.
  • Effective terms are easy to apply as well as easy to recognize. It does not help to make fine logical distinctions among terms, because terms that differ in fine details are often assigned inconsistently. For example, it is not necessary to have separate terms for geologic maps and geologic mapping, because an information resource assigned geologic maps would naturally have been produced by geologic mapping. The distinction between these concepts is unimportant for the task of finding pertinent information resources.
  • Scope notes are relatively short texts written in relatively plain language. Their purpose is only to help people decide whether the term is the one that matches their expectations. Scope notes will never be shown to users apart from the preferred term label, so the scope note does not need to include it. The same considerations that caused Twitter to limit tweets to 160, then 280 bytes apply to scope notes here.
Free text search of a large collection is deceptively effective. When we're looking for an answer to some problem, we don't need all of the relevant documents, we only need enough documents to know that there aren't several very different answers to our problem; we hope the answers come from different people and say essentially the same thing. Since we find documents that use the words we typed, we quickly come to believe that other people write about our problem using the same terms that we do. But this agreement is deceptive--we've simply ignored documents that don't use the text we chose.
How is the thesaurus stored? Can I get a copy of it?

The USGS thesaurus and the other controlled vocabularies we use are stored in a relational database. Each thesaurus has three tables, one for preferred terms including the hierarchical relationships, one for non-preferred terms, and one providing the see-also linkages from one preferred term to another. A single table contains information about all of the thesauri in the database, and associative tables link that with alternative names for thesauri and category terms for thesauri. These tables are illustrated in the following simple entity-relationship diagram:

Thesaurus entity-relationship diagram.

Data dictionary

Table Field Description
term Table of preferred terms, with hierarchical relationships. Field parent indicates the broader term.
code Unique identifier of a term, may be integer or character.
name Text of the term, character.
parent Unique identifier of the parent term, a value of code or NULL.
scope Scope note for the term, often serves as a general definition. May be NULL.
nonpref Table of non-preferred terms. Field also specifies coordinated terms in a USE-WITH relationship.
code Unique identifier of a preferred term, a value of term:code.
name Non-preferred text, character. Cannot match a value of term:name.
also Unique identifier of the coordinated term if the non-preferred text describes a USE-WITH relationship, a value of term:code, or NULL if the non-preferred text describes a USE-FOR relationship.
visible Integer flag indicating whether the non-preferred term should be shown to end-users. If the non-preferred term is informative, this should be 1. If the non-preferred term is a misspelling, this should be 0.
relterm Table of non-hierarchical "see-also" relationships between terms
a Unique identifier of a term, a value of term:code.
b Unique identifier of the related term, a different value of term:code.
thesaurus Information about each thesaurus present in the database.
tag Unique integer identifying the thesaurus. In web services this value is named thcode.
name Preferred name of the thesaurus. Alternative names are specified in the associative table thname.
creator Person or organization responsible for the creation of the thesaurus.
rights Legal statement indicating limitations on usage of the thesaurus, if any.
edition Version number or other edition indicator.
date Revision date, in form YYYY-MM-DD.
tblname Name of the table containing preferred terms, for example term.
codetype Term indicating whether the unique identifiers of the terms are integer (value number) or alphanumeric (value alpha).
contact Name or email address of the primary contact for the thesaurus.
nonpref Name of the table containing non-preferred terms, for example nonpref.
relterm Name of the table containing non-hierarchical "see-also" relationships between terms, for example relterm.
prefix Suggested short prefix to use in XML namespace declarations.
uri Uniform Resource Identifier for the thesaurus, for use in XML namespace declarations.
mdate Last modified date of this record, in format YYYY-MM-DD.
mtime Last modified time of this record, in format HH:MM:SS.
userid User name or email address of the person last modifying this record.
scope General scope statement indicating the intended usage and purpose of the thesaurus.
thname Table of alternative names for thesauri.
thcode Unique identifier of a thesaurus, a value of thesaurus:tag.
name Alternative name of the thesaurus. Must not match any value of thesaurus:name.
thcategory Table of category terms assigned to thesauri.
thcode Unique identifier of a thesaurus to which the term is assigned, a value of thesaurus:tag.
thesaurus Unique identifier of a thesaurus from which the assigned term is drawn, a value of thesaurus:tag.
code Unique identifier of a term from the thesaurus identified in thesaurus.

Download Size Content Format
thesauri.zip 8 MB All thesauri SQLite
USGSThesaurus.rdf 960 kB USGS Thesaurus SKOS RDF-XML
MarinePlanningData.rdf 100 kB Data Categories for Marine Planning SKOS RDF-XML
CMECS.rdf 575 kB Coastal and Marine Ecosystem Classification System SKOS RDF-XML
How was the thesaurus developed? What other vocabularies did you consult?

Philosophy

Search alone is not sufficient to help people find information. Applications intended to help people find information must also help people understand the scientific, technical, and business context in which it is meaningful. People do not in any usable sense find information without knowing what it is they have found and how it relates to other information.

Design goals

  1. The USGS Thesaurus is designed to conform with a recognized standard, ANSI/NISO Z39.19. This standard has been in widespread use throughout the information science community for many years.
  2. The thesaurus is broad and shallow. It is not intended to enumerate or distinguish the fine details of USGS science, and it is not intended to duplicate detailed search within a scientific database on a particular topic that would ordinarily be provided by a web site developer.
  3. The thesaurus is explicitly intended for use in a web browsing environment. Consequently it is strictly hierarchical. No term has more than one broader term; alternative broader terms are shown as related terms instead. Also the number of top terms is intentionally kept small to enable browse interfaces to function well.
  4. The thesaurus is monolingual. Foreign-language equivalents are possible in principle but have not been incorporated into the current design.
  5. The thesaurus is intended to cover only those facets of information for which other controlled vocabularies were either not available or were not optimal for categorizing USGS information. Consequently the thesaurus does not include place names, types of named geographic features, detailed biological taxonomy, chemical and mineral names, USGS publication series names, or names of organizational units and programs.

Development methods

Specialists recognize two different strategies for building controlled vocabularies: top-down, in which terms and their relationships are defined intuitively prior to their direct application in an indexing situation; and bottom-up, in which terms and relationships are added to the vocabulary in the process of indexing. But the same specialists also recognize that most vocabularies are developed using a combination of these two abstract approaches. We developed the USGS thesaurus using this combined strategy. Beginning by simply listing lots of important terms, we grouped those terms using a card-sorting procedure, and then refined the hierarchy with intuitive processes (that is, by relying on what we know). Subsequent revisions have occurred by group deliberation.

Preliminary development of the thesaurus was conducted using commercial software (MultiTES) by a contractor. Subsequent development and revision has occurred in a web-based database application developed by the group meeting the specific needs of this project.

Review of existing controlled vocabularies

We examined many similar controlled vocabularies of various types before and during this process. Examples are the GEOREF thesaurus produced by the American Geological Institute, the CERES thesaurus, the Geographic Names Information System (GNIS), the Integrated Taxonomic Information System (ITIS), the categorization scheme used in the Marine Realms Information Bank, and numerous smaller or more specialized vocabularies such as glossaries of scientific and technical terms presented on USGS web sites.

Who has worked on the USGS Thesaurus?
The USGS Thesaurus Working Group is composed of specialists in library and information sciences, communications, the natural sciences, scientific software development, and data management. Its purpose is to create and maintain controlled vocabularies, use those vocabularies to create catalogs and indexes, and develop methodology that will help people find and understand online USGS information resources. The group is associated with the USGS home page design team and coordinates its work with other project tasks as appropriate.
Name Organization Expertise
Susan Cochran Pacific Coastal and Marine Science Center Ocean sciences, data management
VeeAnn Cross Woods Hole Coastal and Marine Science Center Ocean sciences, data management
Arnell Forde St. Petersburg Coastal and Marine Science Center Ocean sciences, data management
David Govoni Office of Enterprise Information (Retired) Live birds, dead bugs
Leslie Hsu Community for Data Integration Earth science, data integration
Cassandra Ladino Office of Enterprise Information Administrative information systems, data management
Amanda Liford Science Analysis and Synthesis Library Science, data management
Fran Lightsom Woods Hole Coastal and Marine Science Center Ocean science, data management
Peter Schweitzer Eastern Mineral and Environmental Resources Earth science, software development, data management
Lisa Zolly Science Analysis and Synthesis Library Science, software development

Former personnel

The following people have worked with the group at various times in the past. Their influence is substantial.
Name Organization Expertise
USGS employees
Alan Allwardt Geology-Pacific Coastal and Marine Science Center Earth Science, Library Science
Karen Arcamonte Biology Library science
Hylan Beydler Geography-MCMC Land characterization
Nancy Blair GIO-Library Library coordination, cataloging & indexing
Linda Broussard Biology-Library Life sciences, records management
Pamela Callais GIO-Library Cataloging & indexing
Brian Carpenter GIO-Library Library Science
Liz Ciganovich Water-CAPP Publications
Wendy Danchuk Hydrology Cartography, publications
Jeff Dietterle GIO-EWeb Geography, publication
Carmelo Ferrigno GIO-EWeb Information architecture & design
Karen Kaye Biology Information architecture
Richard Huffine GIO-SIEO Library Science
Irena Kavalek GIO-Library Cataloging & indexing
Celso Puente Water Hydrology
Gary Waggoner Biology-CBI Life sciences
Gail Wendt Communications Hydrology, communication, publications
Consultants and outside reviewers
Linda Hill Alexandria Digital Library, UC Santa Barbara
Gail Hodge Information International Associates, Inc.
Candy Schwartz Graduate School of Library and Information Sciences, Simmons College
Jessica Milstead The JELEM Company
Amy Warner Lexonomy Information Architecture Consulting