Crawling a website with PHP, but the website runs JS to generate markup

I have been doing webcrawling for the last couple weeks. Using a PHP library (PHP Simple DOM), im running a php script (using terminal) to fetch some URLs and JSON some data from it. This has been working very nice so far.

Recently i wanted to expand the crawling for a specific site and encountered the following problem:

Unlike any other site so far, this one only echos a barebones markup server side and instead relies on a single JS script to build up the relevant markup onload.

Obviously my PHP script cant handle that (as it is not executing the JS and hence the site stays mostly blank from what i can tell) and so i cant crawl the site, since the content is not yet created.

Im unsure how to proceed. Is it actually possibly to convert my current PHP script to be "compatible" with that site, or do i need to change gears and incorporate a browser, i.e. pick a completely different route ?

Im currently thinking i would need to create html/js site which opens the URL in an iFrame and that way i could run a JS function manually via the console to extract the data. However, im hoping there is a more feasible way.

thanks,

Answers:

Answer

When I need to scrap a website I normally:

1 - Navigate the target website on a normal browser (ff, chrome, etc.), while monitoring/logging any POST/GET requests containing pertinent info via Developer Tools -> Network Tab.
Pay special attention to XHR requests, as they normally contain json encoded data.
Here's a small video I've made exemplifying this:

https://www.youtube.com/watch?v=JbiZBGt8cos

You can mimic the request headers made previously (explained in the video) and use it on a curl request, i.e.:

$headers = [
    "Connection: keep-alive",
    "Accept: application/json, text/javascript, */*; q=0.01",
    "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
    "DNT: 1",
    "Accept-Language: pt,en-US;q=0.9,en;q=0.8,pt-PT;q=0.7,pt-BR;q=0.6",
];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"http://s1te.com/json_rand.php");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$server_output = curl_exec ($ch);
curl_close ($ch);
print  $server_output ;

2 - In some cases, it's impossible to crawl certain URL's without a JavaScript Enabled Client, when this happens, I normally use Selenium with Chrome or Firefox. You can also use PhantomJS, a headless browser. Latest versions of GeckoDriver (used by Selenium) also support headless browsing.


I'm aware the question is about PHP, but if the OP needs to use Selenium, Python is way more intuitive I'd say. Based on that, here's a Selenium example in Python:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()

Example Src

Answer

I see two possible paths:

  • In case the JavaScript that builds up the DOM fetches the data through one or more AJAX calls, you might as well scrape from those URLs directly (and this tends to be easier anyway, e.g. if it talks to a JSON API).

  • Simulate a browser, e.g. using Selenium. For example, this article discusses the exact challenge you mention and provides a solution using Selenium and Python.

Tags

Recent Questions

Top Questions

Home Tags Terms of Service Privacy Policy DMCA Contact Us

©2020 All rights reserved.