Web scraping has undergone various changes and transformations ever since it rose into public consciousness. As automated data collection becomes more accessible over time, the customer base has increased, necessitating changes to accommodate the influx of people.
We sit down with Aleksandras Šulženko, Product Owner at Oxylabs.io, to discuss the past, present, and future of web scraping. He has been involved with both the technical and business side of automated collection for several years and has helped create some of the leading web scraping solutions on the market such as the E-commerce Scraper API, dedicated to collecting external publicly available data from e-commerce websites.
Over the years, how has your understanding and perception of the process changed? How has the industry evolved since you began working with web scraping?
I began my work when scraping was still niche. Today, if you have an online presence, you’ll be somehow involved with scraping. You might be indirectly involved by using databases that have been compiled with web scraping, but you’re still involved.
Another important change, that is somewhat related, is the proliferation of online businesses. Back in the day, monitoring 50 pages was enough to cover the necessary grounds. Now you’d need hundreds if not thousands of domains to get started.
What also comes to mind is the technical side of scraping – it has become significantly harder. Anti-bot solutions, especially general ones, have risen in popularity. Proprietary solutions haven’t been that much of an issue, though.
Some of the solutions are fairly rudimentary. It’s popular, for example, in Brazil to block any IP addresses coming from Europe. Systems such as these are incredibly easy to implement. But if you want to track bots efficiently, things get complicated quickly.
All you get, server-side, is traffic. How do you differentiate between bots and humans? Almost all methods used to track the former lead to probabilistic evaluations. No matter how good the system is. Any bot-activity threshold set will eventually target a legitimate user.
Usually, where the threshold is set depends on the non-technical decision makers of the business. Bot protection is always a trade-off between anti-scraping and user experience, which is, in the end, a trade-off for revenue. And you know what happens when there’s danger to the bottom line.
So, many businesses that have the technical capabilities of creating severe anti-bot protections don’t do it, because even a minor impact on UX would have a sizable impact on user experience. They’d rather only ban the really egregious examples of bots.
In the end, it’s a cat-and-mouse game with the deck stacked slightly in the favor of the mouse. Scraping is, I’d rate, easy-to-medium while bot protection is hard.
When the Scraper APIs were still in their early stages, what were the first or most important upgrades you implemented?
There are other ways to do the same process, such as through APIs, but they’re often less scalable. Additionally, page URLs are much more static representations of specific places within a domain. It’s unlikely that the same URL would start pointing to a completely different page after some time.
In summary, a headless browser is more costly to run (when compared to sending requests), but it’s more scalable than the other options and provides access to data that otherwise wouldn’t be available.
Our development of the Universal scraper has been a turning point as well. Previously, and I think like most starting out, we only created dedicated “one-trick-ponies” that worked on a single domain or, possibly, a set of pages.
Universal scraping is a big step mainly because you can always turn it into a dedicated one with some settings. In other words, if a universal scraper can acquire data from 90% of existing web pages, you can always tweak it to fit some specific use case.
You can’t do it as easily the other way around – turn a dedicated scraper into a universal one. It’s not quantitatively but qualitatively different.
Would you start with dedicated Scraper APIs now over a more generic solution?
The decision on whether to start with a more generic or dedicated solution isn’t the call of the tech people. Business needs take priority – if you have a deal for a single website, there’s no reason to build a universal solution. But if you get a deal for hundreds, such a scraper might come in handy. It becomes extra valuable once you start thinking in the long-term.
We should also take into account that data isn’t distributed evenly across websites. There are certain center points where valuable information for some industries congregates. Scraping these tech giants might be more valuable than some accidental website of one small company.
If you had to completely start over from a basic e-commerce scraper prototype, what would you develop and upgrade first?
One thing I should note is that regardless of what approach you take, what your business needs are, what are the resources available – every solution has to be scalable. I don’t even recommend starting with anything if the solution isn’t, at some point, going to become scalable.
Web scraping is based on a complicated microservice architecture. You can’t predict what exactly you’ll need to scale ahead of time, but you’ll definitely need to do it. Needs, desires, and capabilities will change, necessitating horizontal scaling at the very least.
There are two ways to talk about Scraper APIs if we’re thinking of them at a higher level. One is the technical side of things – the scrapers, the parsers, the domains, etc. The other is the interaction (or the API) itself – how it takes, processes, and handles requests by users. These can be thought of as two distinct but closely related domains.
The API is a little bit easier. We’ve been developing application programming interfaces for ages. There’s plenty of best practice manuals, development guides, and other sensible writings out there. I don’t think we should be reinventing the wheel here, outside of the service side of the API.
On the technical side, I’d like to make things less complicated. As web scraping is going from niche to something everyone does, we’d need fully visual tools with a variety of output formats. We had previously focused on JSON and HTML, however, many businesses have needs for other, differently structured output formats.
Finally, we should be moving towards no-code formats to be more in tandem with the growing market. Researchers, entrepreneurs, and many other non-tech-savvy people are banging at the gates. We should be letting them in, instead of restricting access to web scraping through complicated technical layers.
Oxylabs is carrying some good news as Scraper APIs are now available for a free trial! Now you can test them out and see them in action. Fetch public data from any site you like during a one-week trial period.
- SERP Scraper API – collect public data from major search engines;
- E-Commerce Scraper API – get pricing, product, and other public data from the largest online marketplaces;
- Web Scraper API – gather publicly available data from a majority of websites and build your own use cases.