On this tutorial, we’ll discover find out how to analyze massive textual content datasets with LangChain and Python to search out fascinating knowledge in something from books to Wikipedia pages.

AI is such an enormous matter these days that OpenAI and libraries like LangChain barely want any introduction. However, in case you’ve been misplaced in an alternate dimension for the previous 12 months or so, LangChain, in a nutshell, is a framework for growing purposes powered by language fashions, permitting builders to make use of the facility of LLMs and AI to research knowledge and construct their very own AI apps.

Desk of Contents

Use Instances

Earlier than stepping into all of the technicalities, I feel it’s good to have a look at some use instances of textual content dataset evaluation utilizing LangChain. Listed here are some examples:

  • Systematically extracting helpful knowledge from lengthy paperwork.
  • Visualizing developments inside a textual content or textual content dataset.
  • Making summaries for lengthy and uninteresting texts.


To comply with together with this text, create a brand new folder and set up LangChain and OpenAI utilizing pip:

pip3 set up langchain openai

File Studying, Textual content Splitting and Information Extraction

To research massive texts, equivalent to books, it is advisable to break up the texts into smaller chunks. It’s because massive texts, equivalent to books, include lots of of hundreds to thousands and thousands of tokens, and contemplating that no LLM can course of that many tokens at a time, there’s no solution to analyze such texts as a complete with out splitting.

Additionally, as a substitute of saving particular person immediate outputs for every chunk of a textual content, it’s extra environment friendly to make use of a template for extracting knowledge and placing it right into a format like JSON or CSV.

On this tutorial, I’ll be utilizing JSON. Right here is the book that I’m using for this example, which I downloaded without spending a dime from Challenge Gutenberg. This code reads the e book Past Good and Evil by Friedrich Nietzsche, splits it into chapters, makes a abstract of the primary chapter, extracts the philosophical messages, moral theories and ethical ideas offered within the textual content, and places all of it into JSON format.

As you’ll be able to see, I used the “gpt-3.5-turbo-1106” mannequin to work with bigger contexts of as much as 16000 tokens and a 0.3 temperature to offer it a little bit of creativity. You possibly can experiment with the temperature and see what works finest together with your use case.

Be aware: the temperature parameter determines the liberty of an LLM to make inventive and typically random solutions. The decrease the temperature, the extra factual the LLM output, and the upper the temperature, the extra inventive and random the LLM output.

The extracted knowledge will get put into JSON format utilizing create_structured_output_chain and the supplied JSON schema:

json_schema = {
    "kind": "object",
    "properties": {
        "abstract": {"title": "Abstract", "description": "The chapter abstract", "kind": "string"},
        "messages": {"title": "Messages", "description": "Philosophical messages", "kind": "string"},
        "ethics": {"title": "Ethics", "description": "Moral theories and ethical ideas offered within the textual content", "kind": "string"}
    "required": ["summary", "messages", "ethics"],

chain = create_structured_output_chain(json_schema, llm, immediate, verbose=False)

The code then reads the textual content file containing the e book and splits it by chapter. The chain is then given the primary chapter of the e book as textual content enter:

f = open("texts/Past Good and Evil.txt", "r")
phi_text = str(f.learn())
chapters = phi_text.break up("CHAPTER")

Right here’s the output of the code:

{'abstract': 'The chapter discusses the idea of utilitarianism and its utility in moral decision-making. It explores the thought of maximizing general happiness and minimizing struggling as an ethical precept. The chapter additionally delves into the criticisms of utilitarianism and the challenges of making use of it in real-world situations.', 'messages': 'The chapter emphasizes the significance of contemplating the implications of our actions and the well-being of all people affected. It encourages considerate and empathetic decision-making, bearing in mind the broader influence on society.', 'ethics': 'The moral theories offered within the textual content embody consequentialism, hedonistic utilitarianism, and the precept of the best good for the best quantity.'}

Fairly cool. Philosophical texts written 150 years in the past are fairly onerous to learn and perceive, however this code immediately translated the details from the primary chapter into an easy-to-understand report of the chapter’s abstract, message and moral theories/ethical ideas. The flowchart under will provide you with a visible illustration of what occurs on this code.

Now you are able to do the identical for all of the chapters and put every thing right into a JSON file utilizing this code.

I added time.sleep(20) as feedback, because it’s doable that you just’ll hit charge limits when working with massive texts, most definitely if in case you have the free tier of the OpenAI API. Since I feel it’s helpful to know what number of tokens and credit you’re utilizing together with your requests in order to not by chance drain your account, I additionally used with get_openai_callback() as cb: to see what number of tokens and credit are used for every chapter.

That is the a part of the code that analyzes each chapter and places the extracted knowledge for every in a shared JSON file:

for chi in vary(1, len(chapters), 1):
    with get_openai_callback() as cb:
        ch = chain.run(chapters[chi])

    json_object = json.dumps(ch, indent=4)

    if chi == 1:
        with open("Past Good and Evil.json", "w") as outfile:
    elif chi < len(chapters)-1:
        with open("Beyond Good and Evil.json", "a") as outfile:
        with open("Beyond Good and Evil.json", "a") as outfile:

The chi index begins at 1, as a result of there’s no chapter 0 earlier than chapter 1. If the chi index is 1 (on the primary chapter), the code writes (overwrites any current content material) the JSON knowledge to the file whereas additionally including a gap sq. bracket and new line in the beginning, and a comma on the finish to comply with JSON syntax. If chi isn’t the minimal worth (1) or most worth (len(chapters)-1), the JSON knowledge simply will get added to the file together with a comma on the finish. Lastly, if chi is at its most worth, the JSON will get added to the JSON file with a brand new line and shutting sq. bracket.

After the code finishes operating, you’ll see that Past Good and Evil.json is crammed with the extracted data from all of the chapters.

JSON file filled with data extracted from chapters

Right here’s a visible illustration of how this code works.

Flowchart of how the code works with multiple chapters

Working With A number of Information

If in case you have dozens of separate recordsdata that you just’d like to research one after the other, you should utilize a script much like the one you’ve simply seen, however as a substitute of iterating via chapters, it is going to iterate via recordsdata in a folder.

I’ll use the instance of a folder crammed with Wikipedia articles on the highest 10 ranked tennis gamers (as of December 3 2023) known as top_10_tennis_players. You possibly can obtain the folder here. This code will learn every Wikipedia article, extract every participant’s age, top and quickest serve in km/h and put the extracted knowledge right into a JSON file in a separate folder known as player_data.

Right here’s an instance of an extracted participant knowledge file.

Daniil Medvedev extracted data

Nonetheless, this code isn’t that straightforward (I want it was). To effectively and reliably extract probably the most correct knowledge from texts which might be typically too large to research with out chunk splitting, I used this code:

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
sub_texts = text_splitter.create_documents([player_text])
ch = []
for ti in vary(len(sub_texts)):
    with get_openai_callback() as cb:

    for chi in vary(1, len(ch), 1):
        if (ch[chi]["age"] > ch[0]["age"]) or (ch[0]["age"] == "not discovered" and ch[chi]["age"] != "not discovered"):
            ch[0]["age"] = ch[chi]["age"]
        if (ch[chi]["serve"] > ch[0]["serve"]) or (ch[0]["serve"] == "not discovered" and ch[chi]["serve"] != "not discovered"):
            ch[0]["serve"] = ch[chi]["serve"]
        if (ch[0]["height"] == "not discovered") and (ch[chi]["height"] != "not discovered"):
            ch[0]["height"] = ch[chi]["height"]

In essence, this code does the next:

  1. It splits the textual content into chunks 16000 tokens in dimension, with a piece overlap of 2000 to maintain a little bit of context.
  2. it extracts the required knowledge from every chunk.
  3. If the info extracted from the newest chunk is extra related or correct than that of the primary chunk (or the worth isn’t discovered within the first chunk however is discovered within the newest chunk), it adjusts the values of the primary chunk. For instance, if chunk 1 says 'age': 26 and chunk 2 says 'age': 27, the age worth will get up to date to 27 since we want the participant’s newest age, or if chunk 1 says 'serve': 231 and chunk 2 says 'serve': 232, the serve worth will get up to date to 232 since we’re on the lookout for the quickest serve pace.

Right here’s how the entire code works in a flowchart.

Flowchart of how the code works with multiple files and chunks

Textual content to Embeddings

Embeddings are vector lists which might be used to affiliate items of textual content with one another.

An enormous facet of textual content evaluation in LangChain is looking massive texts for particular chunks which might be related to a sure enter or query.

We will return to the instance with the Past Good and Evil e book by Friedrich Nietzsche and make a easy script that takes a query on the textual content like “What are the failings of philosophers?”, turns it into an embedding, splits the e book into chapters, turns the totally different chapters into embeddings and finds the chapter most related to the inquiry, suggesting which chapter one ought to learn to search out a solution to this query as written by the creator. You could find the code to do that here. This code particularly is what searches for probably the most related chapter for a given enter or query:

embedded_question = embeddings_model.embed_query("What are the failings of philosophers?")
similarities = [] 
tags = []
for i2 in vary(len(emb_list)):
    similarities.append(cosine_similarity(emb_list[i2], embedded_question))
    tags.append(f"CHAPTER {i2}")


The embeddings similarities between every chapter and the enter get put into a listing (similarities) and the variety of every chapter will get put into the tags listing. Essentially the most related chapter is then printed utilizing print(tags[similarities.index(max(similarities))]), which will get the chapter quantity from the tags listing that corresponds to the utmost worth from the similarities listing.



Right here’s how this code works visually.

Flowchart of how the embedding code works

Different Utility Concepts

There are various different analytical makes use of for giant texts with LangChain and LLMs, and regardless that they’re too advanced to cowl on this article of their entirety, I’ll listing a few of them and description how they are often achieved on this part.

Visualizing subjects

You possibly can, for instance, take transcripts of YouTube movies associated to AI, like those in this dataset, extract the AI associated instruments talked about in every video (LangChain, OpenAI, TensorFlow, and so forth), compile them into a listing, and discover the general most talked about AI instruments, or use a bar graph to visualise the recognition of every one.

Analyzing podcast transcripts

You possibly can take podcast transcripts and, for instance, discover similarities and variations between the totally different company when it comes to their opinions and sentiment on a given matter. You may as well make an embeddings script (just like the one on this article) that searches the podcast transcripts for probably the most related conversations primarily based on an enter or query.

Analyzing evolutions of reports articles

There are many massive information article datasets on the market, like this one on BBC news headlines and descriptions and this one on financial news headlines and descriptions. Utilizing such datasets, you’ll be able to analyze issues like sentiment, subjects and key phrases for every information article. You possibly can then visualize how these elements of the information articles evolve over time.


I hope you discovered this useful and that you just now have an concept of find out how to analyze massive textual content datasets with LangChain in Python utilizing totally different strategies like embeddings and knowledge extraction. Better of luck in your LangChain initiatives!