How to Customize ChatGPT with LangChain, Pinecone, and Apify 💪

Hey friend! In this comprehensive guide, we‘re going to explore how you can extend the capabilities of ChatGPT using three awesome tools – LangChain, Pinecone, and Apify.

As you may know, ChatGPT has some limitations:

It‘s trained on data only up until 2021. So no current events or data!
It doesn‘t have any external memory or ability to look things up.
No continuity between conversations.

With LangChain, Pinecone, and Apify, we can overcome these limits to create a customized ChatGPT that:

Answers questions from real-time, up-to-date data (not just 2021 and earlier!).
Has rapid access to indexed knowledge bases and vectors.
Maintains conversational memory and context.

I‘ll walk you through step-by-step how to set this up. Get ready to level up ChatGPT! 🚀

An Overview of LangChain, Pinecone, and Apify

Before we dig into the code, let me briefly explain what each of these tools does:

LangChain is an open source Python framework for building AI applications with large language models like GPT-3. It provides an easy way to integrate LLMs into your own apps!

Some key features:

Simple interfaces to connect with LLMs from OpenAI, Anthropic, Hugging Face, etc.
Tools to split up and process large amounts of text data.
Chains and Agents to build complex conversational systems.
Easy deployment to services like AWS Lambda and Google Cloud Run.

Pinecone is a vector database optimized specifically for natural language processing use cases. It allows you to efficiently index, store, and query millions of text embeddings.

Why it‘s useful:

Blazing fast similarity search on embedding layers, up to 100x faster than alternatives.
Scales embedding matrices to any size, allows models with trillions of parameters.
Cloud hosted, handles infrastructure, no dev ops needed.
Embeddings stay in sync with latest model versions.

Apify is a web automation and scraping platform. It provides pre-built scrapers, connectors, and tools to extract and structure data from websites.

Key features:

Grab data from any site with minimal code using Apify scrapers.
Schedule and run scrapers at scale on Apify‘s cloud infrastructure.
Output scraped data as JSON, CSVs, or export to databases.
Pre-built scrapers for many popular sites like Amazon, Airbnb, etc.

Now that we‘ve covered what each tool does, let‘s see how we can combine them…

Set Up the Environment

I‘ll be using Python for this tutorial, so you‘ll want to set up a Python environment if you don‘t have one already.

Let‘s install the needed packages:

pip install langchain==0.0.189 pinecone-client openai tiktoken nest_asyncio apify-client chromadb

We‘ll also need API keys for OpenAI and Apify:

import os

os.environ["OPENAI_API_KEY"] = "sk-..." # Add your real OpenAI key here

os.environ["APIFY_API_TOKEN"] = "kJ..." # Add your real Apify token here

And we‘re ready to go!

Scraping Data with Apify

To start, we need some real-time data to feed to our LLM. That‘s where Apify comes in!

Apify has a huge library of ready-made scrapers for extracting data from popular sites. We‘ll use the Airbnb Scraper to grab ~500 Airbnb listings from New York City.

from langchain.utilities import ApifyWrapper

apify = ApifyWrapper()

loader = apify.call_actor(
  actor_id = "dtrungtin/airbnb-scraper",
  run_input = {
    "currency": "USD",
    "maxListings": 500,
    "locationQuery": "New York, NY"
  }   
)

It‘ll take a minute or two to run. Once finished, the scraped data is stored in the loader variable as a list of Document objects.

Let‘s take a peek at one document:

print(loader[0])

{
  "page_content": "{\"name\": \"Charming bedroom in Astoria\", ...}",
  "metadata": {
    "source": "https://www.airbnb.com/rooms/123" 
  }
}

Perfect – we have 500 up-to-date Airbnb listings ready to fuel our AI!

Key fact: Apify has over 300 ready-made scrapers for sites like Google, Twitter, Amazon, and more!

Processing the Data with LangChain

Now we‘ll use LangChain to process this scraped data and get it ready for our LLM.

First, let‘s load the JSON content from each document:

from langchain import DocumentLoader

loader = DocumentLoader.from_document_list(loader).load()

Next, we‘ll split the data into smaller chunks of ~1000 tokens using LangChain‘s RecursiveCharacterTextSplitter:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
  chunk_size=1000
)

docs_chunks = splitter.split_documents(loader)

Breaking up the data makes it easier for the LLM to process sequentially.

We can take a look at the first chunk to verify it worked:

print(docs_chunks[0])

Perfect, we now have properly formatted data ready for our LLM!

Indexing Embeddings with Pinecone

To quickly find relevant listings based on search queries, we need to index textual embeddings of each listing in a vector database. That‘s where Pinecone comes in!

First we‘ll import Pinecone and get an API key:

import pinecone

pinecone.init(api_key="abc123...")

Next, we create a new index and start ingesting our documents:

from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings  

index = Pinecone.from_documents(
  docs_chunks,
  OpenAIEmbeddings(),
  index_name="airbnb-index"
)

We‘re using OpenAI‘s text embedding model to vectorize each chunk of text before sending to Pinecone.

This will take a few minutes to fully ingest all vectorized documents. Once done, we can load the index:

index = Pinecone.from_existing_index(
  index_name="airbnb-index",
  embeddings_model=OpenAIEmbeddings()
)

Now we have our airbnb knowledge base indexed in Pinecone, ready for ultrafast similarity search!

Building a Chatbot with LangChain

We have our data scraped, processed, and indexed. Now let‘s use LangChain to build a conversational chatbot!

We‘ll use ChatGPT via the ChatOpenAI class:

from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=0)

And create a RetrievalQA chain that can query our Pinecone index:

from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=index
)

Let‘s give it a test run:

query = "What is the most affordable place in Brooklyn?"

print(qa.run(query))

It will quickly query our indexed Airbnb data and return the best matching result!

To make our chatbot even more advanced, we can use…

Leveraging LangChain Agents

LangChain provides pre-built Agents that make it easy to build sophisticated conversational AI systems.

Agents have capabilities like:

State tracking
Conversation history
Orchestrating toolchains
Clarifying questions

We just need to define the different tools our agent can leverage:

from langchain.agents import Tool, initialize_agent

tools = [
  Tool(name="Airbnb KB", 
        func=qa.run,
        desc="Find Airbnb listings")  
]

agent = initialize_agent(tools=tools, llm=llm, verbose=True)

Now we can chat with our agent:

Human: What are the top rated airbnbs in Manhattan?

Agent: Based on the indexed Airbnb data, here are the top 3 rated listings in Manhattan:

1. Luxury Condo in the Heart of Manhattan (5 star rating)
2. Cozy East Village Sanctuary (4.9 star rating)  
3. Bright Suite in Midtown (4.8 star rating)

And that‘s it! With just a few lines of code, we leveraged Apify, LangChain and Pinecone to build a conversational AI agent powered by up-to-date real world data!

Let‘s do a quick recap:

Apify scraped 500 Airbnb listings from NYC
LangChain processed the data and handled the LLM integration
Pinecone indexed the listings for fast similarity search
We built a chatbot that can answer questions on current Airbnb data!

Key Takeaways and Next Steps

The ability to customize and extend ChatGPT is incredibly powerful. Here are some key lessons:

Apify provides an easy way to scrape niche datasets from the web.
LangChain handles all the complexity of data processing and LLM integration.
Pinecone enables embedding indexes at scale for production use cases.
Together, they unlock new possibilities with LLMs!

Of course, this just scratches the surface of what can be built. Here are some next steps to explore:

Scrape more diverse datasets – Apify has 300+ scrapers ready to use.
Index billions of embeddings with Pinecone by scaling to multiple namespaces.
Build more advanced conversational agents using LangChain‘s toolchains and agents.
Deploy your agent as a web API with AWS Lambda or Google Cloud Run.
Create a front-end chat interface with Streamlit, React, etc.

I hope you found this guide helpful! Let me know if you have any other questions. I‘m always happy to chat more about LangChain, Pinecone, Apify or building amazing things with AI.

All the best,
[Your name]

An Overview of LangChain, Pinecone, and Apify

Set Up the Environment

Scraping Data with Apify

Processing the Data with LangChain

Indexing Embeddings with Pinecone

Building a Chatbot with LangChain

Leveraging LangChain Agents

Key Takeaways and Next Steps

Join the conversation Cancel reply

Related Posts

What‘s the Difference Between Web Scraping and Crawling?

What are some BeautifulSoup alternatives for HTML parsing in Python?

How to Web Scrape with HTTPX and Python