The field of natural language processing (NLP) is rapidly evolving thanks to recent advances in deep learning and computational power. In particular, the emergence of large language models (LLMs) like GPT-3 has transformed many NLP tasks. However, traditional NLP techniques still have an important role to play. In this post, we‘ll explore some foundational NLP techniques and examine how they compare and contrast with modern LLMs.
Overview of traditional NLP techniques
Before diving into specific techniques, let‘s briefly go over the goals of traditional NLP. The aim is to develop methods that allow computers to understand, interpret, and generate human language. Some common tasks include:
-
Text classification: Automatically assigning categories or labels to documents based on their content. This supports applications like sentiment analysis, topic labeling, and spam detection.
-
Information extraction: Identifying and extracting structured information from unstructured text data. This can involve named entity recognition, relation extraction, etc.
-
Machine translation: Automatically translating text from one language to another.
-
Question answering: Answering natural language questions by referencing a knowledge base or unstructured data.
-
Summarization: Generating a shortened version of a long text that captures the main points.
To tackle these tasks, NLP researchers developed a toolkit of techniques including:
Tokenization
This involves splitting text into individual words, phrases, symbols, or other meaningful elements called tokens. This is an essential first step for most NLP workflows.
Stop word removal
Stop words are common words like "the", "and", "is" that don‘t contain much useful information. Removing them can improve processing speed and accuracy for many models.
Stemming and lemmatization
Both techniques aim to reduce words to their root form. Stemming crudely chops off word endings, while lemmatization uses vocabulary context to reduce words properly.
Bag-of-words
With this model, text is represented as an unordered collection of words and their frequency counts. This ignores grammar and word order but is useful for tasks like document classification.
TF-IDF
Term frequency-inverse document frequency (TF-IDF) is an extension of bag-of-words that applies different weights to words based on how common they are across all documents. This can improve results for tasks like search engines and recommendation systems.
Part-of-speech tagging
This labels each word with its part of speech (noun, verb, adjective etc.) based on the word’s definition and context. This adds useful linguistic information for many applications.
Named entity recognition
This identifies and categorizes important entities (like people, places, organizations) within unstructured text. It‘s a key capability for information extraction.
Dependency parsing
This analyzes the grammatical structure of sentences to establish relationships between words and phrase. It can support tasks like question answering systems.
With these building blocks, NLP practitioners could develop complex pipelines for tackling industry problems. However, traditional NLP has some inherent limitations. Next, we‘ll see how modern techniques address them.
The rise of large language models (LLMs)
In recent years, neural network-based approaches have driven rapid progress in NLP. At the forefront are large language models (LLMs) – models trained on massive text datasets that can understand and generate natural language.
LLMs like Google‘s BERT, OpenAI‘s GPT-3, and Stanford‘s ERNIE are transformer-based models. This architecture was first proposed in the 2017 paper “Attention is All You Need”. Transformers have become the backbone of NLP thanks to key strengths:
-
Context modeling: Transformers process entire sequences of text during training. This allows robust context-dependent representations.
-
Parallelization: Transformers facilitate parallel computing, allowing efficient training of models with billions of parameters.
-
Transfer learning: Pre-trained LLMs can be fine-tuned on downstream tasks, eliminating the need to train models from scratch.
Thanks to these capabilities, LLMs have achieved state-of-the-art results across many NLP benchmarks:
-
GLUE benchmark – a collection of 9 sentence understanding tasks. BERT surpassed human performance on this benchmark shortly after release.
-
SQuAD benchmark – 500+ question-answering datasets. BERT achieved over 90% accuracy, topping the leaderboard.
-
SuperGLUE benchmark – A more difficult set of 10 reasoning tasks. GPT-3 scored over 89% accuracy on its release in 2020.
Additionally, LLMs can handle many tasks with few or no examples thanks to their foundational language understanding. This makes them highly versatile and deployable.
However, LLMs don‘t fully solve all NLP challenges. Next, we‘ll examine some key differences between traditional techniques and modern LLMs.
How do traditional NLP and LLMs compare?
While LLMs have unlocked new horizons in NLP, traditional techniques still have value today. Let‘s explore some key differences:
Interpretability
Most traditional techniques are interpretable, with clear mechanical rules or decision boundaries. For example, decision trees explicitly model rules that can be examined.
In contrast, LLMs are complex black-box models. Their billions of parameters encode knowledge implicitly through transformed representations. This makes interpretability an ongoing challenge.
Data dependence
LLMs require massive datasets to train effectively. Traditional NLP can work well with modest-sized corpora. Simple models like Naive Bayes and Logistic Regression can be effective on smaller samples.
This data dependence poses challenges where labeled data is scarce. It also causes issues with bias amplification from the training data.
Engineering effort
LLMs require huge computational resources to train and deploy. Companies like Google and Nvidia have dedicated hardware to support models like BERT and Megatron. Most organizations lack access to such large-scale infrastructure.
In contrast, many traditional techniques work efficiently even on commodity hardware. This reduces engineering complexity for organizations with limited resources.
Hybrid approaches
Today, we‘re seeing hybrid systems that combine LLMs with traditional techniques:
-
Using LLMs for representation learning, then applying simpler models for interpretation and analysis.
-
Employing techniques like POS tagging, dependency parsing, NER as preprocessing steps to improve LLM performance.
-
Adding simple classifiers on top of LLM embeddings to reduce data needs.
These hybrid approaches aim to get the best of both worlds – leveraging the power of LLMs while offsetting some of their weaknesses.
Traditional NLP techniques on web-scraped data
To demonstrate traditional NLP techniques, let‘s walk through an example using text data scraped from the web. We‘ll utilize popular Python libraries like NLTK, spaCy, scikit-learn, and Pandas.
Scrape articles from a news website
First, we‘ll scrape article headlines and bodies from a news site using a web scraping library like BeautifulSoup:
from bs4 import BeautifulSoup
import requests
base_url = ‘http://example.com‘
def scrape_page(page):
url = base_url + page
response = requests.get(url)
soup = BeautifulSoup(response.text, ‘html.parser‘)
for article in soup.find_all(‘article‘):
headline = article.find(‘h2‘).text
body = article.find(‘div‘, class_=‘article-content‘).text
yield {
‘headline‘: headline,
‘body‘: body
}
pages = [‘/politics‘, ‘/tech‘, ‘/health‘]
scraped_articles = []
for page in pages:
scraped_articles.extend(scrape_page(page))
This gives us a list of dictionaries containing article headlines and bodies ready for NLP.
Tokenization and normalization
We‘ll start by splitting the text into tokens and normalizing them to lowercase:
import string
import re
from nltk.tokenize import word_tokenize
def tokenize(text):
tokens = word_tokenize(text)
tokens = [token.strip(string.punctuation) for token in tokens]
tokens = [token.lower() for token in tokens]
return tokens
scraped_articles = [{
‘headline_tokens‘: tokenize(article[‘headline‘]),
‘body_tokens‘: tokenize(article[‘body‘])
} for article in scraped_articles]
This gives us a useful tokenized representation of the text data.
Stop word removal
Next, we filter out common stop words using NLTK‘s stopword list:
from nltk.corpus import stopwords
stop_words = stopwords.words(‘english‘)
def remove_stopwords(tokens):
return [token for token in tokens if token not in stop_words]
scraped_articles = [{
‘headline_tokens‘: remove_stopwords(article[‘headline_tokens‘]),
‘body_tokens‘: remove_stopwords(article[‘body_tokens‘])
} for article in scraped_articles]
Stemming and lemmatization
To consolidate different forms of words, we‘ll apply stemming and lemmatization using NLTK:
from nltk.stem import PorterStemmer, WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
def stem_tokens(tokens):
return [stemmer.stem(token) for token in tokens]
def lemmatize_tokens(tokens):
return [lemmatizer.lemmatize(token) for token in tokens]
for article in scraped_articles:
article[‘headline_stemmed‘] = stem_tokens(article[‘headline_tokens‘])
article[‘body_stemmed‘] = stem_tokens(article[‘body_tokens‘])
article[‘headline_lemmatized‘] = lemmatize_tokens(article[‘headline_tokens‘])
article[‘body_lemmatized‘] = lemmatize_tokens(article[‘body_tokens‘])
Term frequency analysis
We can now extract basic term statistics like frequency across the corpus:
from collections import defaultdict
import pandas as pd
frequency = defaultdict(int)
for article in scraped_articles:
for token in article[‘body_lemmatized‘]:
frequency[token] += 1
term_freqs = pd.DataFrame(frequency.items(), columns=[‘term‘, ‘frequency‘])
term_freqs.sort_values(by=‘frequency‘, ascending=False)
This gives us a DataFrame showing the most common terms across all article bodies.
Text Classification
Using these tokenized and normalized representations, we can train classifiers for tasks like topic labeling. Here is an example with scikit-learn‘s Naive Bayes:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(articles, labels, test_size=0.2)
# Extract features from text data
vectorizer = CountVectorizer()
X_train_vectors = vectorizer.fit_transform(X_train)
# Train a Naive Bayes classifier model
clf = MultinomialNB()
clf.fit(X_train_vectors, y_train)
# Predict on test set and calculate accuracy
X_test_vectors = vectorizer.transform(X_test)
predictions = clf.predict(X_test_vectors)
accuracy = np.mean(predictions == y_test)
print(‘Accuracy:‘, accuracy)
This workflow demonstrates how traditional techniques can unlock value from scraped data.
Conclusion
In summary, traditional NLP and modern LLMs both have important roles to play:
-
Traditional techniques provide accessible and interpretable foundations for many NLP tasks.
-
LLMs have driven dramatic progress through their ability to learn language concepts.
-
Each approach has strengths that can compensate for the other‘s weaknesses.
For real-world applications, we‘re likely to see hybrid systems dominate – combining LLMs for representation learning with simpler traditional models for explainability and engineering benefits.
As data volumes grow and hardware improves, LLMs will continue advancing the horizons of what‘s possible. But traditional techniques will remain useful tools in every NLP practitioner‘s toolkit.