RAG / Reddit / LangChain

Reddit Sentiment to Pinecone and LangChain

A practical pattern for converting subreddit posts into compact sentiment documents with source URLs, then loading them into LangChain and Pinecone.

2026-05-20 · 7 min read

What this pipeline does

This guide turns Reddit posts into retrieval documents that can power product research, buyer objection mining, and market sentiment agents. The HarvestLab Reddit actor returns source URLs, engagement metrics, author context, and post text in a flat schema.

The pipeline extracts posts from selected communities, maps each post into a compact document, embeds the useful text, and stores the result in Pinecone through LangChain.

Document shape

{
  "id": "reddit:LocalLLaMA:abc123",
  "text": "Title and post body prepared for embedding.",
  "metadata": {
    "subreddit": "LocalLLaMA",
    "score": 144,
    "num_comments": 38,
    "source_url": "https://reddit.com/r/LocalLLaMA/comments/abc123",
    "captured_at": "2026-05-20T18:00:00Z"
  }
}

LangChain loading pattern

const documents = redditItems.map((item) => ({
  pageContent: `${item.title}\n\n${item.selftext ?? ""}`,
  metadata: {
    subreddit: item.subreddit,
    score: item.score,
    comments: item.num_comments,
    source_url: item.permalink,
  },
}));

Retrieval notes

Keep the raw Reddit URL in metadata so the agent can cite the original thread. Store engagement metrics beside the embedded text so downstream ranking can prefer high-signal discussions over low-activity posts.