Skip to content

N-grams

You can look at n-grams of words in TED Talks. Click on the Collocation button on the main page to switch to Collocation mode.

Collocation mode on the main page

An n-gram is a sequence of words of n items. Looking at frequencies of various n-grams, you can find out what linguistic sequences are more entrenched in the language and, possibly, what are less so.

N-gram tabs

TCSE offers four n-gram sizes:

  • 1-gram: Single word frequencies
  • 2-gram: Two-word sequences (bigrams)
  • 3-gram: Three-word sequences (trigrams)
  • 4-gram: Four-word sequences

Here is a sample output returned in response to the search key wait:

Results (partial)

Position filter buttons

When n-gram results are displayed, you will see a set of filter buttons above the results table:

  • n-gram ALL: Shows all n-grams containing the search term in any position (default)
  • n-gram #1: Shows only n-grams where the search term appears in position 1
  • n-gram #2: Shows only n-grams where the search term appears in position 2
  • (and so on, up to #n)

For example, when searching for wait in 2-gram mode, clicking #1 shows n-grams where wait comes first (e.g., wait for, wait until), while #2 shows n-grams where wait comes second (e.g., can't wait, please wait).

Position filter buttons and n-gram results

Chunk-based n-grams

In the results table, some rows are displayed with a light blue background. These represent noun phrase chunks — multi-word units that function as a single grammatical unit (e.g., immune system, solar system). Rows without the light blue background are simple word-level n-grams.

Chunk-based n-grams highlighted in light blue

This chunk-based analysis helps you identify meaningful multi-word expressions beyond simple word sequences. Click on any row to search for its instances in the transcript corpus.

NER-enhanced n-grams

When you search for a named entity type (e.g., %PERSON, %GPE, %ORG), the n-gram tables display results where actual named entities are replaced with their type labels. For example, searching %PERSON in 2-gram mode might show entries like %PERSON said or thank %PERSON, revealing how different entity types interact with surrounding linguistic structures regardless of the specific entity name.

Collocation analysis

The Colloc 2 and Colloc 3 tabs provide collocation analysis for your search term, with results grouped by lemma for accurate association measures. See Collocation analysis for details.

Collocation network

The Network tab provides an interactive force-directed graph visualization of collocational relationships. See Collocation Network for details.

Tips

  • Click on any n-gram in the results to search for its instances in the transcript corpus
  • N-gram frequencies reflect actual usage patterns in TED Talks
  • The Colloc and Network tabs use lemma-based aggregation to combine inflected forms