An in depth information to internet scraping utilizing ChatGPT Code Interpreter and its plugins.
If you’re not into creating some novelty, chances are high you want some prerequisite data to start. Or, you may want to look into the competitors for worthwhile enter. In addition, there could be numerous causes for somebody to be curious about a selected web site’s content material.
Web scraping is the method that serves such use circumstances.
And there are a number of methods to go about that. There are heavy-weight instruments you possibly can subscribe to for skilled scraping of massive web sites. Alternatively, chances are you’ll require a selected setup for on-premise processing.
Either approach, the method is pricey, time-consuming, and tedious for newcomers, particularly for scraping a number of internet pages.
Table of Contents
Overview of ChatGPT for Web Scraping
I’m not supposed to introduce ChatGPT to you. Am I?
In brief, ChatGPT is a generative AI that responds like people. You get a chat interface for asking it to full varied duties, corresponding to inquiring about historic occasions, writing essays, summarizing, translating, coding, and many others.
ChatGPT replies in textual content. However, there are ChatGPT plugins that improve its capabilities in some ways. And we’ll be utilizing one such plugin. In addition, we’ll use its Code Interpreter for scraping web sites having difficult webpage buildings or with lively anti-scraping protocols.
Please know that ChatGPT has free and paid variations. But you’ll want the paid subscription (at the moment, $20 a month) for utilizing the online scraper plugin or its Code Interpreter engine.
In additional sections, I’ll illustrate the method step-by-step.
Disclaimer: Before continuing your self, please verify that the topic web site permits scraping their content material. If not, you possibly can contact their admin and see if they enable it for you to keep away from any authorized troubles.
Web Scraping Using ChatGPT Plugin
Login to your OpenAI account, hover over GPT-4 (its present paid model) and click on Plugins.
Next, click on No plugins enabled, scroll down, and click on Plugin Store.
Please be aware that as an alternative of No plugins enabled, you’ll have a plugin icon if one is lively. In that case, you want to click on that icon to open the drop-down and click on the Plugin retailer on the backside.
This will open the Plugin retailer. Search for Scraper and hit Install.
Select this plugin within the ChatGPT interface.
Once that is chosen, one should immediate ChatGPT, mentioning the topic URL and the content material for scraping.
I’ve achieved this for a number of web sites. Check this out.
Scraping a Publication
We are a tech-focussed publication, and I’ve chosen our house web page, geekflare.com/ for this illustration.
Here’s the immediate:
test this webpage: https://geekflare.com/ and put together a desk indicating the article title, creator, publication date, and excerpt for the highest 10 articles.
One may also re-prompt to convert the information into CSV format, paste it in a textual content file with .csv extension, and open it in a spreadsheet utility like MS Excel.
Scraping a Deal or Coupon Webpage
The Geekflare offers part is the place we’ve got handpicked some gives on top-tech initiatives. How about fetching each deal in a tabular format?
Prepare a listing of offers from this webpage: https://geekflare.com/deals/. current the lead to a tabular format.
Summarize in tabular format the newest information from the "in the news" part from this wikipedia web page: https://en.wikipedia.org/wiki/Main_Page
Scraping E-commerce Stores
Lastly, I attempted scraping Amazon.com for the laptops by making use of a number of filters and feeding the URL to ChatGPT. This is what I bought:
The drawback is that this isn’t a single case. You’ll discover many such cases the place the web sites have anti-scraping measures. In this case, you’ll want to discover another for getting the information if subscribing to industry-standard scrapers isn’t an choice.
The following sections entail one such resolution.
Web Scraping Using ChatGPT Code Interpreter
Code Interpreter is a newly launched ChatGPT engine to cater to programming-related duties. While the default engine closely depends on textual content responses, Code Interpreter might help visualize outputs, parse, debug, & execute code, combine with software program binaries, and do much more programming-centric issues.
In this course of, we’ll obtain the supply HTML, add it to ChatGPT Code Interpreter, and proceed with the scraping.
I’ve taken this web page for extraction:
We will start by saving the webpage as HTML. For that, go to the webpage and press
Now we’ve got the file for scraping. Let’s determine the immediate.
In addition to the textual content immediate, you possibly can see I’ve given it pattern parts to fast-track the scraping. Since Amazon’s internet web page buildings are complicated, with out these samples, the scraping try may fail or lead to nothing.
And getting these parts is pretty simple. Right-click wherever on the topic webpage and click on Inspect from the pop-over.
First, click on the topmost icon (marked as 1). This will spotlight the main points whereas you choose parts from the web page. Next, choose the container component for any particular product.
Please guarantee to choose the innermost container. You can hover alongside, and it should maintain highlighting. The second you get the final shell masking that block, you possibly can click on and go over to the precise aspect to copy the component’s
Similarly, choose the samples for different parts.
Finally, add the HTML and immediate related to this:
take a look at this webpage html and extract the laptop computer titles, worth, and scores. current the lead to a tabular format inside this chat interface and additionally give the leads to a CSV to obtain. div class="s-card-container s-overflow-hidden aok-relative puis-include-content-margin puis puis-vfcg1duwvmpo42mcln9ojhiljk s-latency-cf-section s-card-border" pattern title component: span class="a-size-medium a-color-base a-text-normal" pattern worth component: span class="a-price-whole" pattern scores component: span class="a-size-base puis-bold-weight-text"
This will take a while whereas ChatGPT Code Interpreter does its work. You could have a number of particulars, whereas every little thing might be within the embedded CSV file.
You can observe that the desk has a number of entries not current on the unique internet web page, particularly at the beginning. In such circumstances, you want to double-check and clear the information for any redundancies.
If there are any, you possibly can re-prompt ChatGPT to get a clear CSV.
ChatGPT does many issues, and fundamental internet scraping is certainly one of them. Agreed, it won’t be appropriate for somebody scraping a whole lot of pages. Still, it’ll get you began in the precise course and best for a brief scraping session.
In this information, we’ve got used certainly one of its scraping plugins and Code Interpreter. While plugins work on many customary web sites, the second methodology is for customized webpage buildings or if the web page has dynamic parts (limitless scroll, learn extra, and many others.).
And to reiterate, undergo the topic web site phrases earlier than scraping.
PS: Check out these cloud scraping options and our personal Geekflare scraping API.