PageFreezer

World’s cleverest website archiving tool

was supported by technologies and brainpower from Redwerk

PageFreezerPageFreezerVancouver, Canada

PageFeezer.com is an industry-strength web based service for managing, archiving, retaining, and replaying dynamic web content and social media.

All Customers
Product Development

Being a one-stop shop for software development, Redwerk has implemented Pagefreezer.com from the ground up. We went through every phase here: requirements analysis, prototyping, architecture, UI/UX design, development, testing, deployment, maintenance, system administration, and support.

Learn more
Data Mining

Automatic processing of websites and social network APIs, scraping them as big data and rendering archived websites back to the users is what we can code.

Learn more

Challenge

PageFreezer is the name of a technology start-up and also a web service which archives websites in a convenient and easy-to-use way, according to flexible schedules defined by the user. Any website, blog, or even Facebook and Twitter profiles, can be preserved for “future generations” in an interactive way, going much further than common screenshots.

This is a useful service for regulatory compliance, litigation protection, or marketing purposes. PageFreezer is an enterprise-class SaaS solution which supports even the most complex websites, and is convenient for individuals, small firms, as well as large corporations.

PageFreezer makes archiving the web easy, and enables you to re-live archived websites of the past as if they were hot off the press!

Redwerk was tasked with supporting the underlying technology, the IT “intelligence” behind this innovative web service. The goal was to build a SaaS application that would enable clients to permanently preserve their website and social media content in evidentiary quality and then access those archives and replay them as if they were still live. It was fundamental that this solution should support even the most complex websites, blogs, Twitter or Facebook profiles, and all that one the same integrated platform. The application had to use web crawling technologies to capture websites automatically, as often and when users wanted. The crawled content also had to be made searchable.

The main features included:

  • Automatic archiving
  • Public records compliance
  • Live replay/browsing of archives
  • Search for contents
  • Digital signatures
  • Data export
  • Data access through API

Solution

Website Crawling

For PageFreezer, we created a proprietary highly advanced web crawler, which takes into account every minor peculiarity of every known web server and web browser software. It’s a Java library, which integrates well with any project and provides interfaces to override various behaviors.

In order to monitor the crawling processes as conveniently as possible, we created an informative admin interface. We made it possible to crawl and capture images as well as text, and even flash animations, even when they were on different domains. An extra URL list was created for this purpose.

Include, exclude and advanced website settings were introduced, making it even more convenient for users who wish to crawl certain URLs depending on keywords. Flexible user agent selection for crawling was also added. The mechanism was designed to crawl web pages at moments when they are not under high load. Clients can also use the option of crawling speed to configure the number of crawl workers for each individual task to reduce the load on the website.

Redwerk also implemented a standard sitemap XML crawling feature to reduce the time it takes to crawl large websites, because only modified pages and their contents are crawled and archived.

A number of outstanding, technologically advances crawling options were also made available:

  • parsing links out of XML files using XSLT templates
  • generic authentication mechanism allowing crawlers to authorize on almost any website

All of these features make PageFreezer a much more technologically advanced solution compared to the competition.

Website Playback

One of the main goals and most impressive usage scenarios was that users had to be able to browse copies of websites as if they were live now. This was perhaps the key challenge, and involved a lot of complex thinking and innovative approaches. It is based on hyperlink resolution and on-fly substitution, JavaScript and redirect interception and much more.

In order to get to your desired point in time, a convenient calendar was created, highlighting the dates on which the snapshots were taken. In order to allow the user to see the site structure we created a simple navigation tree which reflects the URL hierarchy. All the tree nodes are clickable and open the corresponding site page.

Social Media

Crawling social media profiles was a much harder challenge, as different rules apply to them compared to conventional websites. PageFreezer’s link extraction was initially created with the help of regular expressions and content parsers, but most Twitter, Facebook and other social networks are dynamically built with JavaScript. As they were all different, it was very exhausting to build the framework and extend it to additional social networks. The whole solution was unreliable at this stage, and all future modifications to these social networks would have had to be implemented in the system, too. In the end, it was decided to develop a social network adapter based on third-party social network client libraries in Java. Spring Social was identified as meeting our requirements.

Data Storage

One of the most difficult tasks in this project was to select the best storage option, which had to be very scalable. The project started with approximately 500 sites, but had to be prepared for much more. We toyed with the idea of using S3 or Google for some time, but those proved to be too slow to access and too expensive. So Redwerk had to come up with a more flexible, custom-tailored idea, and after some benchmarking we built a simple yet scalable custom storage cloud from scratch, based on a database and NFS file system.

Data Integrity

It was essential, as always, to ensure that no information was lost in case of failure of any part of the system. We implemented a modern logic which makes crawlers stop and wait in case the database or the file system are unavailable. When these components come back, no information gathered by the crawlers is lost, and the use of checksums helps maintain the integrity of all stored data.

Digital Signatures

A digital signature is a set of algorithms and other methods for validating digital documents or messages. They are used almost in all sectors of economy to detect forgery or tampering, making it a fundamental security tool.

The PageFreezer service is no exception. Here, Redwerk opted TSA, used by PageFreezer to digitally sign all crawled content. Hash data of crawled content, verified certificates, user keys and timestamps are all used when signing through TSA. Therefore, a valid TSA signature is what guarantees to PageFreezer clients a reason to believe that original webpage was crawled at particular moment of time. PageFreezer data can even be used as evidence in court thanks to this implementation.

Once the system is enabled, all snapshots available to the user will be signed through TSA, and the signature can be verified on the browsing page at any time.

Security

To protect data from destructive forces and the unwanted actions of unauthorized users we use a rock-solid combination of firewalls, fail2ban, back-ups and slave database servers. Generally speaking, the system was created to be as modular and scalable as possible. The components do not affect the performance of each other. Crawlers are separate processes, and different modules were designed for logged-in users and guests.

Need a team to build your product?

Request Quote

10

developers on the dedicated team

4

QA engineers on the team

7

years long engagement

2,631,855

lines of code

Results

This was the kind of challenging software outsourcing that Redwerk is renowned for. The solution was successfully prototyped, built and underwent a couple of re-designs over the last couple of years, to make sure it stays state-of-the-art.

Redwerk has been adding new functionalities to meet new demands by PageFreezer’s customers. Our software developers handle all the maintenance of the system, including such administrative tasks as upgrades and backups of the database and the archived content. Today, PageFreezer is the leading solution for flexible online content archiving needs, and we are proud to say Redwerk’s technology and know-how have contributed to its success!

Visit pagefreezer.com

PageFreezer tool dashboard

Awarded

Red Herring Global 100 Finalist

Red Herring Top 100 Global Finalist

I've been working with Redwerk almost continuously since 2006 on various complex software development projects (C++, Java, JSP, Spring, Django, iPhone). This company provides excellent software application development services for a great price. They are very flexible, customer-focused, responsive and communicative. I would warmly recommend other companies to hire them for your software development projects.
Michiel Riedijk PageFreezer partner

Michiel Riedijk, CEO at PageFreezer.com

Want an award too? Work with us!

Contact Now

Looking for the best price-to-quality ratio?

Contact Us
It is a legal level archiving. The sort of archiving that has to be done to be in compliance for public companies and governments so that they can prove that their website said and did exactly what they’re saying it did at any one particular time.
Steve Dotto Dotto Tech

Steve Dotto, TV show host at Dotto Tech

Automatic archiving technology
Archiving technology development
Public records compliance
Page Freezer report
Page Freezer report list
Sitemap XML crawling feature
Social media archiving technology
Social media data export feature
JAVA library development
Page Freezer dashboard
Page Freezer users
Parsing links out of XML
XSLT templates

Impressed? Hire us

Contact Now
Request Quote
×