Following redirects to find a final destination — seemingly easy at first. CURL the URL, let curl follow the redirects, then spit out the final destination. That was my initial thought… then the “gotcha”. We also need to follow javascript redirects to find the “final-final” destination.
Enter Headless Chrome, a PhantomJS-like tool for “viewing” web pages without a browser. The big benefit with Headless Chrome over PhantomJS being that Headless Chrome uses the latest version of the Blink rendering engine as opposed to PhantomJS’ WebKit rendering engine, the very engine that Chrome ditched a while back. What this (hopefully) means for us is that when you load up a page in Headless Chrome and export it to a PDF, it should look the same as if you were to open it up in the Chrome browser. This seems like a perfect fit for our problem.
The test page
First thing’s first, we need a test page. Or, more specifically, three test pages.
On page 1, we’ll have a simple PHP redirect:
<?php header( 'Location: page2.html' ); ?>
On page 2, we’ll do a JavaScript redirect:
<html> <head> <title>Test Page 2</title> <script type="text/javascript"> window.location = "page3.html"; </script> </head> <body> <p>Hello world!</p> </body> </html>
And finally, on page 3, we’ll let ourselves know that we made it all the way to the end:
<html> <head> <title>Test Page 3</title> </head> <body> <p>Final destination!</p> </body> </html>
And now, just to prove that this whole process is necessary, we’ll run CURL with the -L flag to follow redirects and confirm that we’re only making it to the second page:
$ curl -L http://scripts.local.dev/page1.php <html> <head> <title>Test Page 2</title> <script type="text/javascript"> window.location = "page3.html"; </script> </head> <body> <p>Hello world!</p> </body> </html>
And if you test it in the browser, you’ll get all the way to the third page as expected.
Headless Chrome
Now let’s test it with Headless Chrome and see what we get.
First you’ll need to find your Chrome browser installation location. I’m on OSX, so it’s at
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome
Then we’ll need to set the headless flag, disable GPU acceleration, and then we’ll dump out the dom:
–headless –disable-gpu –dump-dom http://scripts.local.dev/page1.php
So all together we’ll have
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --headless --disable-gpu --dump-dom http://scripts.local.dev/page1.php
Which gives us this output
$ /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --headless --disable-gpu --dump-dom http://scripts.local.dev/page1.php <body> <p>Final destination!</p> </body>
Perfect! Exactly what we were expecting. Now let’s see what we have to do to get those redirects.
The Script
Now, using a slightly modified script from the chrome-remote-interface repository we can get all the redirects including the JavaScript ones:
const CDP = require('chrome-remote-interface'); const options = { host: '127.0.0.1', port: 9222 }; const initialUrl = 'http://scripts.local.dev/page1.php' // initialize an array to hold each redirected URL var hops = []; hops.push(initialUrl); CDP(options, (client) => { // extract domains const {Page} = client; Page.loadEventFired((timestamp) => { client.close(); }); Page.frameNavigated((frame) => { console.log(frame.frame.url); }); // enable events then start! Promise.all([ Page.enable() ]).then(() => { return Page.navigate({url: initialUrl}); }).catch((err) => { console.error(err); client.close(); }); }).on('error', (err) => { // cannot connect to the remote endpoint console.error(err); });
Be sure to install the dependencies via NPM
$ npm install chrome-remote-interface
And run the Headless Chrome instance in the background with
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --headless --disable-gpu --remote-debugging-port=9222
Run the script, and you should see something like this
$ node redirect-test.js http://scripts.local.dev/page2.html http://scripts.local.dev/page3.html
And there we have our redirect chain, complete with JavaScript redirects. So now we have a script that we can plug into other parts of our system.
In the source code there is hops, refer to what is that object?
Hey Sony,
“hops” is just an array that holds each redirect URL that we’re taken through, and that was mistakenly left out of the code. I’ve added it back in. Thanks for catching that!
Nice script. Do you know how I could also get the response code of each domain redirect and then take a screen shot of the last redirect??
For example if red.com redirects to blue.com then yellow.com. I would like to get these redirects saved to an array but also the response code (e.g 302, 302, 200) for each url too. Either saved to a different array or preferably the same one.
Then I would like a screenshot of the last URL so in this case yellow.com.
Actually upon testing this script I realise it doesn’t work.
It only seems to get the last domain redirect URL, and also sometimes gets many other URL’s which are not redirect URL’s such as doubleclick.net and social media widgets.
Try running the script with these examples.
Examples:
const initialUrl = ‘http://google.com’; (only gets last redirect)
const initialUrl = ‘http://yahoo.com’; (gets last redirect url and also approx 20 others such as doubleclick.net and yahoo image and ad server urls)
const initialUrl = ‘http://madonna.edu’; (gets twitter/facebook urls)
const initialUrl = ‘http://youtube.com’; (only gets last redirect url and also doubleclick.net and ad server urls)