• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Home
  • About Us
  • Contact Us

iHash

News and How to's

  • Prodigy Afterschool Masterclasses for Kids for $99

    Prodigy Afterschool Masterclasses for Kids for $99
  • 10.1" WiFi Digital Photo Frame with Photo/Video Sharing for $149

    10.1" WiFi Digital Photo Frame with Photo/Video Sharing for $149
  • 8" WiFi Cloud Photo Frame for $112

    8" WiFi Cloud Photo Frame for $112
  • 8" WiFi Digital Photo Frame with Auto Rotation & Photo/Video Sharing for $112

    8" WiFi Digital Photo Frame with Auto Rotation & Photo/Video Sharing for $112
  • Wireless Wall Tap Smart Plug for $39

    Wireless Wall Tap Smart Plug for $39
  • News
    • Rumor
    • Design
    • Concept
    • WWDC
    • Security
    • BigData
  • Apps
    • Free Apps
    • OS X
    • iOS
    • iTunes
      • Music
      • Movie
      • Books
  • How to
    • OS X
      • OS X Mavericks
      • OS X Yosemite
      • Where Download OS X 10.9 Mavericks
    • iOS
      • iOS 7
      • iOS 8
      • iPhone Firmware
      • iPad Firmware
      • iPod touch
      • AppleTV Firmware
      • Where Download iOS 7 Beta
      • Jailbreak News
      • iOS 8 Beta/GM Download Links (mega links) and How to Upgrade
      • iPhone Recovery Mode
      • iPhone DFU Mode
      • How to Upgrade iOS 6 to iOS 7
      • How To Downgrade From iOS 7 Beta to iOS 6
    • Other
      • Disable Apple Remote Control
      • Pair Apple Remote Control
      • Unpair Apple Remote Control
  • Special Offers
  • Contact us

Building an E-commerce Scraper – insideBIGDATA

Mar 31, 2022 by iHash Leave a Comment

Sponsored Post

Web scraping has undergone various changes and transformations ever since it rose into public consciousness. As automated data collection becomes more accessible over time, the customer base has increased, necessitating changes to accommodate the influx of people.

We sit down with Aleksandras Šulženko, Product Owner at Oxylabs.io, to discuss the past, present, and future of web scraping. He has been involved with both the technical and business side of automated collection for several years and has helped create some of the leading web scraping solutions on the market such as the E-commerce Scraper API, dedicated to collecting external publicly available data from e-commerce websites.

Over the years, how has your understanding and perception of the process changed? How has the industry evolved since you began working with web scraping?

I began my work when scraping was still niche. Today, if you have an online presence, you’ll be somehow involved with scraping. You might be indirectly involved by using databases that have been compiled with web scraping, but you’re still involved.

Another important change, that is somewhat related, is the proliferation of online businesses. Back in the day, monitoring 50 pages was enough to cover the necessary grounds. Now you’d need hundreds if not thousands of domains to get started.

What also comes to mind is the technical side of scraping – it has become significantly harder. Anti-bot solutions, especially general ones, have risen in popularity. Proprietary solutions haven’t been that much of an issue, though.

Some of the solutions are fairly rudimentary. It’s popular, for example, in Brazil to block any IP addresses coming from Europe. Systems such as these are incredibly easy to implement. But if you want to track bots efficiently, things get complicated quickly.

All you get, server-side, is traffic. How do you differentiate between bots and humans? Almost all methods used to track the former lead to probabilistic evaluations. No matter how good the system is. Any bot-activity threshold set will eventually target a legitimate user.

Usually, where the threshold is set depends on the non-technical decision makers of the business. Bot protection is always a trade-off between anti-scraping and user experience, which is, in the end, a trade-off for revenue. And you know what happens when there’s danger to the bottom line.

So, many businesses that have the technical capabilities of creating severe anti-bot protections don’t do it, because even a minor impact on UX would have a sizable impact on user experience. They’d rather only ban the really egregious examples of bots.

In the end, it’s a cat-and-mouse game with the deck stacked slightly in the favor of the mouse. Scraping is, I’d rate, easy-to-medium while bot protection is hard.

When the Scraper APIs were still in their early stages, what were the first or most important upgrades you implemented? 

One of the greatest and most important updates has been the headless browser mode. It does cost more computing power overall than simple requests, but it lets you load JavaScript elements. Without them, lots of data wouldn’t even be displayed and be otherwise impossible to reach.

There are other ways to do the same process, such as through APIs, but they’re often less scalable. Additionally, page URLs are much more static representations of specific places within a domain. It’s unlikely that the same URL would start pointing to a completely different page after some time.

In summary, a headless browser is more costly to run (when compared to sending requests), but it’s more scalable than the other options and provides access to data that otherwise wouldn’t be available.

Our development of the Universal scraper has been a turning point as well. Previously, and I think like most starting out, we only created dedicated “one-trick-ponies” that worked on a single domain or, possibly, a set of pages.

Universal scraping is a big step mainly because you can always turn it into a dedicated one with some settings. In other words, if a universal scraper can acquire data from 90% of existing web pages, you can always tweak it to fit some specific use case. 

You can’t do it as easily the other way around – turn a dedicated scraper into a universal one. It’s not quantitatively but qualitatively different.

Would you start with dedicated Scraper APIs now over a more generic solution?

The decision on whether to start with a more generic or dedicated solution isn’t the call of the tech people. Business needs take priority – if you have a deal for a single website, there’s no reason to build a universal solution. But if you get a deal for hundreds, such a scraper might come in handy. It becomes extra valuable once you start thinking in the long-term.

We should also take into account that data isn’t distributed evenly across websites. There are certain center points where valuable information for some industries congregates. Scraping these tech giants might be more valuable than some accidental website of one small company.

If you had to completely start over from a basic e-commerce scraper prototype, what would you develop and upgrade first?

One thing I should note is that regardless of what approach you take, what your business needs are, what are the resources available – every solution has to be scalable. I don’t even recommend starting with anything if the solution isn’t, at some point, going to become scalable.

Web scraping is based on a complicated microservice architecture. You can’t predict what exactly you’ll need to scale ahead of time, but you’ll definitely need to do it. Needs, desires, and capabilities will change, necessitating horizontal scaling at the very least.

There are two ways to talk about Scraper APIs if we’re thinking of them at a higher level. One is the technical side of things – the scrapers, the parsers, the domains, etc. The other is the interaction (or the API) itself – how it takes, processes, and handles requests by users. These can be thought of as two distinct but closely related domains.

The API is a little bit easier. We’ve been developing application programming interfaces for ages. There’s plenty of best practice manuals, development guides, and other sensible writings out there. I don’t think we should be reinventing the wheel here, outside of the service side of the API.

On the technical side, I’d like to make things less complicated. As web scraping is going from niche to something everyone does, we’d need fully visual tools with a variety of output formats. We had previously focused on JSON and HTML, however, many businesses have needs for other, differently structured output formats.

Finally, we should be moving towards no-code formats to be more in tandem with the growing market. Researchers, entrepreneurs, and many other non-tech-savvy people are banging at the gates. We should be letting them in, instead of restricting access to web scraping through complicated technical layers.

Oxylabs is carrying some good news as Scraper APIs are now available for a free trial! Now you can test them out and see them in action. Fetch public data from any site you like during a one-week trial period.

  • SERP Scraper API – collect public data from major search engines;
  • E-Commerce Scraper API – get pricing, product, and other public data from the largest online marketplaces;
  • Web Scraper API – gather publicly available data from a majority of websites and build your own use cases.

Source link

Share this:

  • Facebook
  • Twitter
  • Pinterest
  • LinkedIn

Filed Under: BigData

Special Offers

  • Prodigy Afterschool Masterclasses for Kids for $99

    Prodigy Afterschool Masterclasses for Kids for $99
  • 10.1" WiFi Digital Photo Frame with Photo/Video Sharing for $149

    10.1" WiFi Digital Photo Frame with Photo/Video Sharing for $149
  • 8" WiFi Cloud Photo Frame for $112

    8" WiFi Cloud Photo Frame for $112
  • 8" WiFi Digital Photo Frame with Auto Rotation & Photo/Video Sharing for $112

    8" WiFi Digital Photo Frame with Auto Rotation & Photo/Video Sharing for $112
  • Wireless Wall Tap Smart Plug for $39

    Wireless Wall Tap Smart Plug for $39

Reader Interactions

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

E-mail Newsletter

  • Facebook
  • GitHub
  • Instagram
  • Pinterest
  • Twitter
  • YouTube

More to See

ZuoRAT Malware Hijacking Home-Office Routers to Spy on Targeted Networks

Jun 28, 2022 By iHash

Cisco Talos Supports Ukraine Through Empathy

Cisco Talos Supports Ukraine Through Empathy

Jun 28, 2022 By iHash

Tags

* Apple Cisco computer security cyber attacks cyber crime cyber news Cyber Security cybersecurity cyber security news cyber security news today cyber security updates cyber threats cyber updates data breach data breaches google hacker hacker news Hackers hacking hacking news how to hack incident response information security iOS iOS 7 iOS 8 iPhone iPhone 6 Malware microsoft network security Privacy ransomware malware risk management security security breaches security vulnerabilities software vulnerability the hacker news Threat update video web applications

Latest

How CIOs and CISOs can collaborate for success in the new cloud era

How CIOs and CISOs can collaborate for success in the new cloud era

The rapid adoption of multicloud IT environments and the transition to hybrid workforces demand a new dynamic in the C-suite: a closer alliance between CIOs and CISOs.  By joining forces, CIOs and CISOs can strike a healthy balance between pushing the pace of tech innovation and mitigating risk. Moving to the cloud — especially to […]

What Is Data Reliability Engineering?

Data Reliability Engineering (DRE) is the work done to keep data pipelines delivering fresh and high-quality input data to the users and applications that depend on them. The goal of DRE is to allow for iteration on data infrastructure, the logical data model, etc. as quickly as possible, while—and this is the key part! —still […]

Prodigy Afterschool Masterclasses for Kids for $99

Expires June 28, 2122 23:59 PST Buy now and get 85% off KEY FEATURES Unlock Your Child’s Potential For Success! No dream is too big when you have the tools to achieve it. Whether your child dreams of saving lives as a doctor or inspiring people through the arts, Prodigy will give them the tools […]

Charlie Klein

Key-Thoughts on Cross-Organizational Observability Strategy

Logz.io ran two surveys earlier this year to better understand current trends, challenges, and strategies for implementing more effective and efficient observability – including the DevOps Pulse Survey and a survey we ran with Forrester Research. Together, we received responses from 1300+ DevOps and IT Ops practitioners on observability challenges, opportunities, and ownership strategies. Additionally, […]

8" WiFi Cloud Photo Frame for $112

Expires June 25, 2122 23:59 PST Buy now and get 13% off KEY FEATURES With the 8″ WiFi Cloud Photo Frame you can send photos from your phone to your frame, control which photos are to be sent to your frame, and update your images instantly. You can send photos from any device with an […]

8" WiFi Digital Photo Frame with Auto Rotation & Photo/Video Sharing for $112

Expires June 25, 2122 23:59 PST Buy now and get 19% off KEY FEATURES Send Pictures and Videos from your smartphone to eco4life WiFi Digital Photo Frame, from anywhere in the world using the eco4life App. The eco4life smart frame is simply the best way to enjoy your favorite photos and videos with your families […]

Jailbreak

Pangu Releases Updated Jailbreak of iOS 9 Pangu9 v1.2.0

Pangu has updated its jailbreak utility for iOS 9.0 to 9.0.2 with a fix for the manage storage bug and the latest version of Cydia. Change log V1.2.0 (2015-10-27) 1. Bundle latest Cydia with new Patcyh which fixed failure to open url scheme in MobileSafari 2. Fixed the bug that “preferences -> Storage&iCloud Usage -> […]

Apple Blocks Pangu Jailbreak Exploits With Release of iOS 9.1

Apple has blocked exploits used by the Pangu Jailbreak with the release of iOS 9.1. Pangu was able to jailbreak iOS 9.0 to 9.0.2; however, in Apple’s document on the security content of iOS 9.1, PanguTeam is credited with discovering two vulnerabilities that have been patched.

Pangu Releases Updated Jailbreak of iOS 9 Pangu9 v1.1.0

  Pangu has released an update to its jailbreak utility for iOS 9 that improves its reliability and success rate.   Change log V1.1.0 (2015-10-21) 1. Improve the success rate and reliability of jailbreak program for 64bit devices 2. Optimize backup process and improve jailbreak speed, and fix an issue that leads to fail to […]

Activator 1.9.6 Released With Support for iOS 9, 3D Touch

  Ryan Petrich has released Activator 1.9.6, an update to the centralized gesture, button, and shortcut manager, that brings support for iOS 9 and 3D Touch.

Copyright iHash.eu © 2022
We use cookies on this website. By using this site, you agree that we may store and access cookies on your device. Accept Read More
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT