The good people at the National Library are conducting another Web Harvest. It is good to see that they learnt from the last one, and this time have a comprehensive policy for dealing with robots.txt rules. If you have a site that doesn’t fall under the .nz pre-selection criteria don’t forget to nominate it.
New Zealand web harvest 2010: The National Library is conducting a whole of domain web harvest between 12 and 25 May.
Why does the National Library collect websites?
The National Library exists to preserve New Zealand’s social and cultural history, whether in the form of books, newspapers and photographs, or of websites, blogs and videos.
The New Zealand Web Harvest 2010 harvest recognises the importance of the internet in all areas of New Zealand society and culture by taking a ‘snapshot’ of the New Zealand internet in May 2010.
Information for website owners
The harvest will run for approximately 14 days, from 12 to 25 May 2010.
The harvest will only collect publicly viewable web content. If your website, or parts of it, is password protected, this content will not be harvested.
We will harvest every domain in the .nz country code, and some others from .com, .net and .org. If you have a website outside .nz, you can ensure it is harvested and added to the Library’s collections by completing our Nomination form.
To submit a site map for harvesting, complete our Nomination form.
The web harvester will generally honour the robots.txt convention, with some exceptions. For example, if an image file is embedded in a web page, we will take a copy of that image file in order to have a complete copy of the web page.
If you set a robots.txt rule specifically for our harvester, NLNZHarvester2010, it will follow that rule strictly. However, we will always take a copy of a website’s homepage, regardless of the robots.txt rule.
If you have comments or questions, please complete our Feedback form.
About the whole of domain web harvest
The National Library has commissioned the Internet Archive (an American-based not-for-profit) to perform the harvest on our behalf.
- We will attempt to acquire:
- Websites that fall under the .nz country code
- Websites that fall under .com. .net and .org that can be programmatically determined to be hosted on machines that are physically located in New Zealand
- Selected websites based overseas that are covered by the provisions of the National Library of New Zealand Act (2003).
We estimate we will capture 130-140 million URLs, resulting in 7-8 terabytes of uncompressed data.
Keeping you informed
Notice of the harvest was first published on Thursday 8 April March 2010, giving a five-week notification period.
The Library will keep website owners and other affected parties up to date throughout the harvest via this web page.
Regular progress updates will be posted on our LibraryTechNZ blog, and various mailing lists and forums will also be used to communicate with website owners.