Topics In Demand
Notification
New

No notification found.

Blog
Best Web Scraping or Web Crawling Ethics to Follow

March 4, 2020

576

0

Many of us are always thinking about what are the best practices one should follow when undertaking a web scrape projects. Although there have no major legal hurdles in scraping publicly available data to really write about (other than a one off case of Ryan Air), it is best advised to follow a few steps that will keep you on right side of law.

1. Never swamp the targeted site to extent of denying access to other legitimate users. You can do this by limiting your access to their non-peak hours and ramping up in the evenings till dawn, on weekends and public holidays. Some popular sites like Google, Yahoo, Amazon, Facebook etc. warn you if you access the content too fast. That is a warning signal for you to slow your scraper down.

2. Never download the same content more than once as you are just wasting their bandwidth. Try and download all content to your local machine in one go and then do the processing.

3. Try not to be the #1 user of the targeted site. If they ever get around to checking log files, you do not want to be at the top of their list. You may use proxy IPs to conceal your activity to not appear as #1 use of the site.

4. Ask the client if he has necessary permission from the site owner to download data. If the site owner finds value in sharing the data and gives permission, it is a huge plus in scraping the content.

5. If the targeted site demands you create an account (paid/free) to access data, do not use aliases. Use actual information and inform the client upfront or demand client provide access to website.

6. If the site sends a warning email, respect that. Immediately cease the scraping, delete all data and cease the project. The client will understand.

I hope the above are useful tips. Feel free to share your thoughts & experiences on best practices for web scraping projects.


That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.


ITSYS Solutions is a leading online data acquisition company providing multinational organizations with clean, structured, customizable data from the web. These datasets are fetched keeping in mind the business nuances of our partners due to our ability to identify future business opportunities, competitor analysis, customer retention & more…

© Copyright nasscom. All Rights Reserved.