Traditional NLP techniques and the rise of LLMs

The field of natural language processing (NLP) is rapidly evolving thanks to recent advances in deep learning and computational power. In particular, the emergence of large language models (LLMs) like GPT-3 has transformed many NLP tasks. However, traditional NLP techniques still have an important role to play. In this post, we‘ll explore some foundational NLP techniques and examine how they compare and contrast with modern LLMs.

Overview of traditional NLP techniques

Before diving into specific techniques, let‘s briefly go over the goals of traditional NLP. The aim is to develop methods that allow computers to understand, interpret, and generate human language. Some common tasks include:

Text classification: Automatically assigning categories or labels to documents based on their content. This supports applications like sentiment analysis, topic labeling, and spam detection.
Information extraction: Identifying and extracting structured information from unstructured text data. This can involve named entity recognition, relation extraction, etc.
Machine translation: Automatically translating text from one language to another.
Question answering: Answering natural language questions by referencing a knowledge base or unstructured data.
Summarization: Generating a shortened version of a long text that captures the main points.

To tackle these tasks, NLP researchers developed a toolkit of techniques including:

Tokenization

This involves splitting text into individual words, phrases, symbols, or other meaningful elements called tokens. This is an essential first step for most NLP workflows.

Stop word removal

Stop words are common words like "the", "and", "is" that don‘t contain much useful information. Removing them can improve processing speed and accuracy for many models.

Stemming and lemmatization

Both techniques aim to reduce words to their root form. Stemming crudely chops off word endings, while lemmatization uses vocabulary context to reduce words properly.

Bag-of-words

With this model, text is represented as an unordered collection of words and their frequency counts. This ignores grammar and word order but is useful for tasks like document classification.

TF-IDF

Term frequency-inverse document frequency (TF-IDF) is an extension of bag-of-words that applies different weights to words based on how common they are across all documents. This can improve results for tasks like search engines and recommendation systems.

Part-of-speech tagging

This labels each word with its part of speech (noun, verb, adjective etc.) based on the word’s definition and context. This adds useful linguistic information for many applications.

Named entity recognition

This identifies and categorizes important entities (like people, places, organizations) within unstructured text. It‘s a key capability for information extraction.

Dependency parsing

This analyzes the grammatical structure of sentences to establish relationships between words and phrase. It can support tasks like question answering systems.

With these building blocks, NLP practitioners could develop complex pipelines for tackling industry problems. However, traditional NLP has some inherent limitations. Next, we‘ll see how modern techniques address them.

The rise of large language models (LLMs)

In recent years, neural network-based approaches have driven rapid progress in NLP. At the forefront are large language models (LLMs) – models trained on massive text datasets that can understand and generate natural language.

LLMs like Google‘s BERT, OpenAI‘s GPT-3, and Stanford‘s ERNIE are transformer-based models. This architecture was first proposed in the 2017 paper “Attention is All You Need”. Transformers have become the backbone of NLP thanks to key strengths:

Context modeling: Transformers process entire sequences of text during training. This allows robust context-dependent representations.
Parallelization: Transformers facilitate parallel computing, allowing efficient training of models with billions of parameters.
Transfer learning: Pre-trained LLMs can be fine-tuned on downstream tasks, eliminating the need to train models from scratch.

Thanks to these capabilities, LLMs have achieved state-of-the-art results across many NLP benchmarks:

GLUE benchmark – a collection of 9 sentence understanding tasks. BERT surpassed human performance on this benchmark shortly after release.
SQuAD benchmark – 500+ question-answering datasets. BERT achieved over 90% accuracy, topping the leaderboard.
SuperGLUE benchmark – A more difficult set of 10 reasoning tasks. GPT-3 scored over 89% accuracy on its release in 2020.

Additionally, LLMs can handle many tasks with few or no examples thanks to their foundational language understanding. This makes them highly versatile and deployable.

However, LLMs don‘t fully solve all NLP challenges. Next, we‘ll examine some key differences between traditional techniques and modern LLMs.

How do traditional NLP and LLMs compare?

While LLMs have unlocked new horizons in NLP, traditional techniques still have value today. Let‘s explore some key differences:

Interpretability

Most traditional techniques are interpretable, with clear mechanical rules or decision boundaries. For example, decision trees explicitly model rules that can be examined.

In contrast, LLMs are complex black-box models. Their billions of parameters encode knowledge implicitly through transformed representations. This makes interpretability an ongoing challenge.

Data dependence

LLMs require massive datasets to train effectively. Traditional NLP can work well with modest-sized corpora. Simple models like Naive Bayes and Logistic Regression can be effective on smaller samples.

This data dependence poses challenges where labeled data is scarce. It also causes issues with bias amplification from the training data.

Engineering effort

LLMs require huge computational resources to train and deploy. Companies like Google and Nvidia have dedicated hardware to support models like BERT and Megatron. Most organizations lack access to such large-scale infrastructure.

In contrast, many traditional techniques work efficiently even on commodity hardware. This reduces engineering complexity for organizations with limited resources.

Hybrid approaches

Today, we‘re seeing hybrid systems that combine LLMs with traditional techniques:

Using LLMs for representation learning, then applying simpler models for interpretation and analysis.
Employing techniques like POS tagging, dependency parsing, NER as preprocessing steps to improve LLM performance.
Adding simple classifiers on top of LLM embeddings to reduce data needs.

These hybrid approaches aim to get the best of both worlds – leveraging the power of LLMs while offsetting some of their weaknesses.

Traditional NLP techniques on web-scraped data

To demonstrate traditional NLP techniques, let‘s walk through an example using text data scraped from the web. We‘ll utilize popular Python libraries like NLTK, spaCy, scikit-learn, and Pandas.

Scrape articles from a news website

First, we‘ll scrape article headlines and bodies from a news site using a web scraping library like BeautifulSoup:

from bs4 import BeautifulSoup
import requests

base_url = ‘http://example.com‘

def scrape_page(page):
   url = base_url + page
   response = requests.get(url)
   soup = BeautifulSoup(response.text, ‘html.parser‘)
   for article in soup.find_all(‘article‘):
      headline = article.find(‘h2‘).text
      body = article.find(‘div‘, class_=‘article-content‘).text
      yield {
         ‘headline‘: headline,
         ‘body‘: body
      }

pages = [‘/politics‘, ‘/tech‘, ‘/health‘]
scraped_articles = []

for page in pages:
   scraped_articles.extend(scrape_page(page))

This gives us a list of dictionaries containing article headlines and bodies ready for NLP.

Tokenization and normalization

We‘ll start by splitting the text into tokens and normalizing them to lowercase:

import string
import re
from nltk.tokenize import word_tokenize

def tokenize(text):
    tokens = word_tokenize(text)
    tokens = [token.strip(string.punctuation) for token in tokens]
    tokens = [token.lower() for token in tokens] 
    return tokens

scraped_articles = [{
    ‘headline_tokens‘: tokenize(article[‘headline‘]),
    ‘body_tokens‘: tokenize(article[‘body‘])  
} for article in scraped_articles]

This gives us a useful tokenized representation of the text data.

Stop word removal

Next, we filter out common stop words using NLTK‘s stopword list:

from nltk.corpus import stopwords

stop_words = stopwords.words(‘english‘)

def remove_stopwords(tokens):
    return [token for token in tokens if token not in stop_words]

scraped_articles = [{
   ‘headline_tokens‘: remove_stopwords(article[‘headline_tokens‘]),
   ‘body_tokens‘: remove_stopwords(article[‘body_tokens‘])
} for article in scraped_articles]

Stemming and lemmatization

To consolidate different forms of words, we‘ll apply stemming and lemmatization using NLTK:

from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def stem_tokens(tokens):
    return [stemmer.stem(token) for token in tokens]

def lemmatize_tokens(tokens):
   return [lemmatizer.lemmatize(token) for token in tokens]  

for article in scraped_articles:
    article[‘headline_stemmed‘] = stem_tokens(article[‘headline_tokens‘]) 
    article[‘body_stemmed‘] = stem_tokens(article[‘body_tokens‘])

    article[‘headline_lemmatized‘] = lemmatize_tokens(article[‘headline_tokens‘])
    article[‘body_lemmatized‘] = lemmatize_tokens(article[‘body_tokens‘])

Term frequency analysis

We can now extract basic term statistics like frequency across the corpus:

from collections import defaultdict
import pandas as pd

frequency = defaultdict(int)
for article in scraped_articles:
  for token in article[‘body_lemmatized‘]:
      frequency[token] += 1

term_freqs = pd.DataFrame(frequency.items(), columns=[‘term‘, ‘frequency‘])  
term_freqs.sort_values(by=‘frequency‘, ascending=False)

This gives us a DataFrame showing the most common terms across all article bodies.

Text Classification

Using these tokenized and normalized representations, we can train classifiers for tasks like topic labeling. Here is an example with scikit-learn‘s Naive Bayes:

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(articles, labels, test_size=0.2) 

# Extract features from text data
vectorizer = CountVectorizer()
X_train_vectors = vectorizer.fit_transform(X_train)

# Train a Naive Bayes classifier model
clf = MultinomialNB()
clf.fit(X_train_vectors, y_train)

# Predict on test set and calculate accuracy
X_test_vectors = vectorizer.transform(X_test)
predictions = clf.predict(X_test_vectors)
accuracy = np.mean(predictions == y_test)
print(‘Accuracy:‘, accuracy)

This workflow demonstrates how traditional techniques can unlock value from scraped data.

Conclusion

In summary, traditional NLP and modern LLMs both have important roles to play:

Traditional techniques provide accessible and interpretable foundations for many NLP tasks.
LLMs have driven dramatic progress through their ability to learn language concepts.
Each approach has strengths that can compensate for the other‘s weaknesses.

For real-world applications, we‘re likely to see hybrid systems dominate – combining LLMs for representation learning with simpler traditional models for explainability and engineering benefits.

As data volumes grow and hardware improves, LLMs will continue advancing the horizons of what‘s possible. But traditional techniques will remain useful tools in every NLP practitioner‘s toolkit.

Overview of traditional NLP techniques

Tokenization

Stop word removal

Stemming and lemmatization

Bag-of-words

TF-IDF

Part-of-speech tagging

Named entity recognition

Dependency parsing

The rise of large language models (LLMs)

How do traditional NLP and LLMs compare?

Interpretability

Data dependence

Engineering effort

Hybrid approaches

Traditional NLP techniques on web-scraped data

Scrape articles from a news website

Tokenization and normalization

Stop word removal

Stemming and lemmatization

Term frequency analysis

Text Classification

Conclusion

Join the conversation Cancel reply

Related Posts

What‘s the Difference Between Web Scraping and Crawling?

What are some BeautifulSoup alternatives for HTML parsing in Python?

How to Web Scrape with HTTPX and Python