On this tutorial, you’ll learn to use Python and the OpenAI API to carry out information mining and evaluation in your information.
Manually analyzing datasets to extract helpful information, and even utilizing easy packages to do the identical, can typically get difficult and time consuming. Fortunately, with the OpenAI API and Python it’s attainable to systematically analyze your datasets for attention-grabbing data with out over-engineering your code and losing time. This can be utilized as a common answer for information evaluation, eliminating the necessity to use totally different strategies, libraries and APIs to research various kinds of information and information factors inside a dataset.
Let’s stroll by means of the steps of utilizing the OpenAI API and Python to research your information, beginning with how you can set issues up.
Setup
To mine and analyze information by means of Python utilizing the OpenAI API, set up the openai
and pandas
libraries:
pip3 set up openai pandas
After you’ve accomplished that, create a brand new folder and create an empty Python file inside your new folder.
Analyzing Textual content Recordsdata
For this tutorial, I believed it might be attention-grabbing to make Python analyze Nvidia’s newest earnings name.
Obtain the latest Nvidia earnings call transcript that I bought from The Motley Fool and transfer it into your venture folder.
Then open your empty Python file and add this code.
The code reads the Nvidia earnings transcript that you simply’ve downloaded and passes it to the extract_info
operate because the transcript
variable.
The extract_info
operate passes the immediate and transcript because the consumer enter, in addition to temperature=0.3
and mannequin="gpt-3.5-turbo-16k"
. The rationale it makes use of the “gpt-3.5-turbo-16k” mannequin is as a result of it may well course of giant texts akin to this transcript. The code will get the response utilizing the openai.ChatCompletion.create
endpoint and passes the immediate and transcript variables as consumer enter:
completions = openai.ChatCompletion.create(
mannequin="gpt-3.5-turbo-16k",
messages=[
{"role": "user", "content": prompt+"nn"+text}
],
temperature=0.3,
)
The complete enter will appear to be this:
Extract the next data from the textual content:
Nvidia's income
What Nvidia did this quarter
Remarks about AI
Nvidia earnings transcript goes right here
Now, if we go the enter to the openai.ChatCompletion.create
endpoint, the total output will appear to be this:
{
"selections": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "Actual response",
"role": "assistant"
}
}
],
"created": 1693336390,
"id": "request-id",
"mannequin": "gpt-3.5-turbo-16k-0613",
"object": "chat.completion",
"utilization": {
"completion_tokens": 579,
"prompt_tokens": 3615,
"total_tokens": 4194
}
}
As you possibly can see, it returns the textual content response in addition to the token utilization of the request, which could be helpful when you’re monitoring your bills and optimizing your prices. However since we’re solely within the response textual content, we get it by specifying the completions.selections[0].message.content material
response path.
For those who run your code, it is best to get the same output to what’s quoted beneath:
From the textual content, we will extract the next data:
- Nvidia’s income: Within the second quarter of fiscal 2024, Nvidia reported file Q2 income of 13.51 billion, which was up 88% sequentially and up 101% 12 months on 12 months.
- What Nvidia did this quarter: Nvidia skilled distinctive progress in varied areas. They noticed file income of their information middle section, which was up 141% sequentially and up 171% 12 months on 12 months. In addition they noticed progress of their gaming section, with income up 11% sequentially and 22% 12 months on 12 months. Moreover, their skilled visualization section noticed income progress of 28% sequentially. In addition they introduced partnerships and collaborations with corporations like Snowflake, ServiceNow, Accenture, Hugging Face, VMware, and SoftBank.
- Remarks about AI: Nvidia highlighted the robust demand for his or her AI platforms and accelerated computing options. They talked about the deployment of their HGX programs by main cloud service suppliers and client web corporations. In addition they mentioned the purposes of generative AI in varied industries, akin to advertising and marketing, media, and leisure. Nvidia emphasised the potential of generative AI to create new market alternatives and increase productiveness in several sectors.
As you possibly can see, the code extracts the information that’s specified within the immediate (Nvidia’s income, what Nvidia did this quarter, and remarks about AI) and prints it.
Analyzing CSV Recordsdata
Analyzing earnings-call transcripts and textual content information is cool, however to systematically analyze giant volumes of information, you’ll must work with CSV information.
As a working instance, obtain this Medium articles CSV dataset and paste it into your venture file.
For those who have a look into the CSV file, you’ll see that it has the “writer”, “claps”, “reading_time”, “hyperlink”, “title” and “textual content” columns. For analyzing the medium articles with OpenAI, you solely want the “title” and “textual content” columns.
Create a brand new Python file in your venture folder and paste this code.
This code is a bit totally different from the code we used to research a textual content file. It reads CSV rows one after the other, extracts the desired items of data, and provides them into new columns.
For this tutorial, I’ve picked a CSV dataset of Medium articles, which I bought from HSANKESARA on Kaggle. This CSV evaluation code will discover the general tone and the primary lesson/level of every article, utilizing the “title” and “article” columns of the CSV file. Since I at all times come throughout clickbaity articles on Medium, I additionally thought it might be attention-grabbing to inform it to seek out how “clickbaity” every article is by giving each a “clickbait rating” from 0 to three, the place 0 is not any clickbait and three is excessive clickbait.
Earlier than I clarify the code, analyzing all the CSV file would take too lengthy and value too many API credit, so for this tutorial, I’ve made the code analyze solely the primary 5 articles utilizing df = df[:5]
.
It’s possible you’ll be confused concerning the following a part of the code, so let me clarify:
for di in vary(len(df)):
title = titles[di]
summary = articles[di]
additional_params = extract_info('Title: '+str(title) + 'nn' + 'Textual content: ' + str(summary))
attempt:
consequence = additional_params.break up("nn")
besides:
consequence = {}
This code iterates by means of all of the articles (rows) within the CSV file and, with every iteration, will get the title and physique of every article and passes it to the extract_info
operate, which we noticed earlier. It then turns the response of the extract_info
operate into a listing to separate the totally different items of information utilizing this code:
attempt:
consequence = additional_params.break up("nn")
besides:
consequence = {}
Subsequent, it provides each bit of information into a listing, and if there’s an error (if there’s no worth), it provides “No consequence” into the listing:
attempt:
apa1.append(consequence[0])
besides Exception as e:
apa1.append('No consequence')
attempt:
apa2.append(consequence[1])
besides Exception as e:
apa2.append('No consequence')
attempt:
apa3.append(consequence[2])
besides Exception as e:
apa3.append('No consequence')
Lastly, after the for
loop is completed, the lists that comprise the extracted information are inserted into new columns within the CSV file:
df = df.assign(Tone=apa1)
df = df.assign(Main_lesson_or_point=apa2)
df = df.assign(Clickbait_score=apa3)
As you possibly can see, it provides the lists into new CSV columns which are identify “Tone”, “Main_lesson_or_point” and “Clickbait_score”.
It then appends them to the CSV file with index=False
:
df.to_csv("information.csv", index=False)
The rationale why it’s important to specify index=False
is to keep away from creating new index columns each time you append new columns to the CSV file.
Now, when you run your Python file, await it to complete and examine our CSV file in a CSV file viewer, you’ll see the brand new columns, as pictured beneath.
For those who run your code a number of occasions, you’ll discover that the generated solutions differ barely. It’s because the code makes use of temperature=0.3
so as to add a little bit of creativity into its solutions, which is helpful for subjective subjects like clickbait.
Working with A number of Recordsdata
If you wish to robotically analyze a number of information, it is advisable first put them inside a folder and ensure the folder solely incorporates the information you’re fascinated with, to forestall your Python code from studying irrelevant information. Then, set up the glob
library utilizing pip3 set up glob
and import it in your Python file utilizing import glob
.
In your Python file, use this code to get a listing of all of the information in your information folder:
data_files = glob.glob("data_folder/*")
Then put the code that does the evaluation in a for
loop:
for i in vary(len(data_files)):
Contained in the for
loop, learn the contents of every file like this for textual content information:
f = open(f"data_folder/{data_files[i]}", "r")
txt_data = f.learn()
Additionally like this for CSV information:
df = pd.read_csv(f"data_folder/{data_files[i]}")
As well as, be sure that to save lots of the output of every file evaluation right into a separate file utilizing one thing like this:
df.to_csv(f"output_folder/information{i}.csv", index=False)
Conclusion
Keep in mind to experiment together with your temperature parameter and alter it in your use case. In order for you the AI to make extra artistic solutions, enhance your temperature, and if you would like it to make extra factual solutions, be sure that to decrease it.
The mix of OpenAI and Python information evaluation has many purposes other than article and earnings name transcript evaluation. Examples embody information evaluation, ebook evaluation, buyer evaluate evaluation, and far more! That stated, when testing your Python code on massive datasets, be sure that to solely check it on a small a part of the total dataset to save lots of API credit and time.