Prehistory
A popular use case in RPA is a process in which you need to extract a bunch of structured data from a website. For example, we may want to extract the result of a search request in Google, items data in an online shop (from multiple pages), or anything else - there are a lot of examples. These processes have the following things in common:
- We work with structured data. It means we know what we want to extract, for example, prices, links, and ratings. We also know where these fields are located.
- We want to do it on a regular basis.
A question is how to build solutions for such cases. Before the summer release, it was possible to build such solutions with our ElectroNeek platform but it was always a bit nontrivial, and here's why (let's omit our goal to extract data from multiple pages and focus on extracting data from one page only for the sake of simplicity):
- We need to learn how to extract data. For that, activities such as Get element property or Get element value. Besides, if you want to extract, say, four positions (title, link, price, rating), then we'll have to use this activity four times specifying different UI elements. As we want to extract these positions for all elements of the same type on the page, we will need to run a loop iterating through each element on the page. There is also an alternative option - the "Several elements" option in the parameters of the activities mentioned above. This option would allow you to extract a specific property of the whole range of elements which shortens the number of actions you need to do.
- We need to learn how to save the data. When we extract data, we typically want to save it to either a variable (so that we can process it further) or to a file (for example, Excel, Google Sheets, CSV files). Since the data we extracted is stored in a variable, we will have to process the variable to save it to, say, Google Sheets.
This all doesn't sound trivial and intuitive. Also, if we want to apply it to multiple pages on a website, it will become even more difficult (though not much more) as we will have to run another loop. As a result, we've developed a separate instrument that will simplify significantly all of the actions listed above.
What is Web Scraper
Web Scraper is an instrument that allows you to extract a portion of data from a website without coding the algorithm of running through the elements on the website. All you will need to do is to specify from which type of elements we want to extract data.
Currently, this tool will not allow you to extract data from multiple pages (you will still have to run a loop for that) but the part with data extraction will become much more intuitive and easier to construct.
What Web Scraper consists of
Before demonstrating the instrument, let's take a look at what is included in the instrument:
- A new type of file - .rel. This type of file needs to be created through Studio Pro. It has a separate canvas with a different list of functions. The purpose of this file is to store the structure of the elements from which we need to extract data. You do not have to develop the whole algorithm and you do not have to code. It is just the structure of data.
- A new activity - Scrape structured data. This activity is placed in the section Web Automation - Browser and accepts a path to a .rel file. It returns extracted data and saves it to a table. If needed, the data can be saved to a variable (you will need to disconnect the table block and replace it with the Assign value to variable activity saving the previous step result).
- A new selection mode in the Desktop Picker tool. This mode is launched when you click on the "Pick new element" button in the "Data element" function that relates to the .rel file. This mode allows you to select a range of elements at a time.
The Web Scraper tool will keep improving and receiving new functions, watch our updates.
Example
Let's learn how to work with the tool through an example. Let's suppose that we want to open Amazon, find best-selling American history books, and extract all links, titles, and prices of the search result.
In the picture above, purple rectangles refer to the titles, and blue and green rectangles - to the prices. We didn't highlight links because they correspond to the titles (they are clickable).
We will extract data from the first result page only and save it to an Excel table.
Step 1 - Creation of data structure
Setting up environment
We do not need to emulate opening "Amazon" and navigating to the category. We will replace all these actions with just one link that already contains the search result. Open manually the Google Chrome or Microsoft Edge browser and copy and paste the link: https://www.amazon.com/gp/bestsellers/books/4808?ref_=Oct_d_obs_S&pd_rd_w=ySuss&content-id=amzn1.sym.9aa64228-a828-4242-9935-e693c0cc3357&pf_rd_p=9aa64228-a828-4242-9935-e693c0cc3357&pf_rd_r=GA95S5W71S3FMX2WBBRH&pd_rd_wg=UZms0&pd_rd_r=d7ea146d-3881-4190-8399-7f0ae11e6d37
Creation of element relations file
Let's start with the most interesting, time-consuming, and yet fairly simple step - preparing a file that will store the data structure (or relations between elements, those phrases have a similar meaning in this context). Open Studio Pro, click "File" - "Create element relations flowchart".
You will see how a separate canvas will be opened. There will be a "Start" block on it and a separate list of functions on the left panel. Currently, Web Scraper has only one function - "Data element".
Save the file, let's call it scrape_bestsellers.rel.
Building data structure
Drag and drop the "Data element" function to the canvas and connect it with the "Start" block. If the block is highlighted with a red exclamation mark then it only means that some required parameters are not filled. We will fill all of them during the process.
Selection of elements
Let's select elements from which we want to extract data. Click on the "Pick new element" button to launch the selection mode. You will see that when you hover your mouse over different elements, there is a blue rectangle highlighting them.
Hover your mouse over an element so that the first title of the search page becomes highlighted with the blue rectangle and press CTRL + X. As you can see, a few elements of the same type were selected as well. Now you need to check if all titles are selected. If not, hover the cursor over unselected titles and press CTRL + X again. The number of selected elements must increase.
If you do not want to select some highlighted elements, hover the cursor over them and press CTRL + Y. The number of selected elements must decrease.
This way, by combining CTRL + X and CTRL + Y, you can highlight all the desired elements.
After you select all desired elements, press ESC. You will see the Browser Picker window with the number of found elements shown.
Previewing data
In the Browser Picker window, there's the "Test" button - it shows the selected elements. This way you can make sure that your choice is correct (see the screenshot below).
You can also see the "Preview Data" button. Click on it - you will see the extracted links.
In the drop-down list, you will see attributes from which we read the data. As we selected titles, that are links essentially, the default output attribute will be href
. However, we can add more attributes and read data from them. Click on the drop-down list and select another attribute - innerText
.
This attribute will be used to extract the titles (the text itself). After you've added this attribute, there will be another column. You may not see it explicitly because of the size of the first column so you just need to scroll to the right.
You may also use the vertical scroll to view the whole range of data. This is how you can test your selection.
You can also change the name of the element to something intuitive, for example, "Book titles". Click "Save" to return to the canvas.
Finishing working with function
In the "Data element" block, there are only two parameters left - "Properties to extract" and "Output name". We will use the same attributes in the "Properties to extract" parameter that we used when previewing the data - href
and innerText
.
"Output name" is how the corresponding column in the resulting table will be named. Let's call it title
. Then in the resulting table, we will have two columns: title_href
and title_innerText
.
Adding more functions
Now let's set up the canvas so the prices get extracted as well. Drag and drop another "Data element" function and connect it with the "Start" block. We will adjust the position of the "Start" block to put it in between the two functions as this reflects the hierarchy and relations better - there is the start node (the Amazon page itself) and two elements of the same type from which we want to extract data.
Let's set up the second "Data element" block similarly to how we did it with the first one. Only this time we will select the prices.
There is also one more interesting thing with the selection mode. If you press CTRL + H, elements from the previous selection (from another "Data element" functions on the canvas) will be highlighted or hidden. This is meant to help you orient in what you are selecting.
Press ESC to save your selection of prices. You can also test it again.
We will not change the extracted property this time, but we will make use of the "Include data from other selections to the preview" checkbox. If you enable this option, you will see two columns from the other "Data element" block on the canvas. Now you can check the data and see if it actually matches. Click "Save" to exit.
Let's finish setting up the block. We will specify innerText
as the property to extract and price
as the output name (see the screenshot below).
We've prepared the data structure of Amazon and are ready to test it.
Testing structure
Before creating a .neek file, we can make a final test to see how it extracts data from the page. Click on the "Start" and paste the link we posted above to it. The "Test scraping" button will become active.
Make sure you use the "Chrome Native" method of interaction with the browser ("Settings" - "Interaction with browser"), then just leave the Amazon page opened in Chrome. Click on the "Test scraping" button - the bot will open a new tab and scrape data from there. The result will be saved to the data_preview
variable on the "Variables" tab.
Save changes to the .rel file.
Step 2 - Setting up a bot
Create a new .neek file and save it. Drag and drop the Open URL activity. In the URL parameter, specify the link to the page. Drag and drop the Scrape structured data activity from the "Web Automation" - "Browser" section. We already have a file with data structure so we do not need to click on the button "Create new data relation file". Click "Pick" and select the previously saved .rel file.
Click on the Save Table activity that follows the Scrape structured data activity. It is already set up in a way that the resulting Excel file will be placed in the same folder where the bot is located. But you may modify it if needed. For example, you may save the result to a Google Sheets table. We leave it as it is.
Save changes to the .neek file.
Step 3 - Launch
Since we haven't added any activities related to opening a browser instance and we use the "Chrome Native" method (see above), we need to make sure that Google Chrome is opened, the Amazon page is opened there as well, and that there is no another tab with the exact same content.
Launch the bot. It will scrape the data nearly simultaneously and save it to a table. Navigate to the folder with the bot, open the table, and check the result.
How to extract data from multiple pages
Suppose we want to extract data not only from the first page (or row of data) but from the first two (or three, four, or until the end). Let's suppose that we would like to focus on one category only, say, "Best Sellers". In the row with the book, you will notice the right arrow button on the right-hand side. We will make use of this button.
All we need to do is to create a loop:
The loop is simple:
- We create a counter variable
counter = 0
, that will count the number of iterations. - We create a condition for the loop in the If...then activity:
counter < 2
(since we want to make two iterations). - In the body loop, the bot scrapes the data from the current row of data (or page in another possible example).
- In the "Save table" activity, we'll need to adjust the path. Let's select the "Calculate a value" option and set it to
"data_" + counter + ".xlsx"
. This way, a new file will be generated for each iteration. - Then we click on the "Next" button.
- The bot goes to the next iteration.
A similar method should be used with other websites when you want to extract structured data from multiple pages or sets of data.
How to edit existing data structure
To open and edit an already existing data structure in Studio Pro, do the following:
- Click "File" - "Open".
- Change the filter to the .rel extension.
- Search for the desired file and open it.