Written by Julius Cerniauskas, CEO at Oxylabs
Recent big data scandals have given web scraping a bad reputation. This article lists reasons that may convince you that web scraping is a force for good.
The Cambridge Analytica scandal along with other data breaches have given the data extraction industry a negative reputation. That’s a hard reality to face, because (a) I lead a company that provides ethically-sourced proxies for public data extraction, and (b) I believe that web scraping can be a force for good.
I realise that some people will need to be convinced that this is true because positive stories don’t get nearly as many clicks as negative ones. But they do exist, and I hope to change some minds with this article.
There’s no going back: big data is here to stay
A wise person in the business told me not long ago that the solution to “bad” or poorly-used technology is not to dispose of it. The solution is to upgrade how it works or improve standards for how it’s used.
This reminded me of issues that arose following the invention of the automobile, and how initial reckless driving and a lack of safety laws led to many injuries and deaths. The car wasn’t to blame, it was the way it was being driven. Therefore the solution was to create better driving conditions and training that emphasised safety.
That example comes to mind when I think about the issues affecting our industry. We believe that the solution to many of the widespread concerns can be addressed through ethical tools to support web scraping, along with the use of best practices.
Web scraping helps pave the path to a better internet
There’s a lot of great work being done out there by technology firms, startups and independent developers looking to leverage web scraping in positive ways. This list is just a small sample of some of those projects:
Online marketplaces are an obvious example of the power of web scraping because almost all of us use these websites on a regular basis. And while I understand that cheap flights, bargain hotel rooms and rock-bottom gadget prices may not be saving the planet, these sites have made many of these products and services accessible to a wider audience. On top of that, new aggregator websites have brought to light ethical manufacturers that produce goods of high quality under fair working conditions.
“Watchdog” monitoring groups & journalists
Investigative journalists and “watchdog” monitoring groups use scrapers to source, track and compare information from public sites in their reporting. A notable (and controversial) example is the Reveal Project . This group compared member lists across several Facebook groups and found overlap between “extremist” organisations and law enforcement groups. Other examples include a Reuters investigation that uncovered an underground market for adopted children , and another that tracked elements of the online gun market .
Numerous sites tracking positive test results and/ or deaths attributed to Coronavirus demonstrated the power of web scraping for collecting data on a global scale. Notable examples include partnerships with TrackCorona and CoronaMapper .
Along with online product marketplaces, the job market has benefited immensely from the power of web scraping. There are dozens of sites like CareerBuilder that list jobs from a wide variety of industries all over the world.
Tracking fake news
Journalistic integrity is a worldwide concern, no matter where an individual is located on the political spectrum. Besides disrupting the political process, fake news wields immense power whether it comes from a corporate or independent source.
Some startups are tackling the problem head-on by leveraging the power of web scraping along with machine learning algorithms to process large amounts of data from thousands of sources. The results, when analysed, provide insights into the credibility of the story according to its source and political slant.
Detecting illegal content
Illegal content is an increasing concern among business leaders and politicians. While there are many systems in place that track complaints, checking each individually and removing the content manually is impossibly time-consuming and inefficient.
Web scraping is fast-tracking efforts to combat this serious problem. A notable example is a project developed by the Oxylabs team in cooperation with the Communications Regulatory Authority of the Republic of Lithuania (RRT) that produced an AI-powered tool for detecting content depicting child abuse .
Besides being able to identify prohibited visual material, the solution automatically sends a notification to the RRT hotline where appropriate actions are taken to prosecute the offenders.
Ethical web scraping makes it possible
It is possible to engage in large-scale data collection while respecting the server infrastructure of public websites and the privacy of users. A crystal-clear code of ethics and a framework for web scraping customers should include:
• Scraping only publicly-available web pages.
• Ensuring that the data is requested at a fair rate and that it doesn’t compromise the web server.
• Respecting the data obtained and any privacy issues relevant to the source website.
• Studying the website’s legal documents, deciding if they will be accepted, and determining if the terms will be breached upon acceptance.
• Making use of proxies that are procured ethically.
A final word
Web scraping, like many things in life, can be used in positive and negative ways. Data breaches and associated scandals may make headlines, but that shouldn’t overshadow all the good work taking place in the world of data extraction. The benefits of ethical web scraping benefit our industry and extend into society. The solution is better technology that will power the next evolution of web scraping that makes it all possible.
About the author
Julius Černiauskas is the CEO of Oxylabs, a global provider of premium proxies and data scraping solutions that helps businesses to realise their full potential by harnessing the power of data. Julius’ experience and understanding of the data collection industry has allowed him to implement a new company structure, taking product and service technology to the next level, as well as securing long-term partnerships with dozens of Fortune 500 companies. He regularly speaks on the topics of web scraping, big data, machine learning, technology trends, and business leadership.