I have been doing webcrawling for the last couple weeks. Using a PHP library (PHP Simple DOM), im running a php script (using terminal) to fetch some URLs and JSON some data from it. This has been working very nice so far.
Recently i wanted to expand the crawling for a specific site and encountered the following problem:
Unlike any other site so far, this one only echos a barebones markup server side and instead relies on a single JS script to build up the relevant markup onload.
Obviously my PHP script cant handle that (as it is not executing the JS and hence the site stays mostly blank from what i can tell) and so i cant crawl the site, since the content is not yet created.
Im unsure how to proceed. Is it actually possibly to convert my current PHP script to be "compatible" with that site, or do i need to change gears and incorporate a browser, i.e. pick a completely different route ?
Im currently thinking i would need to create html/js site which opens the URL in an iFrame and that way i could run a JS function manually via the console to extract the data. However, im hoping there is a more feasible way.
When I need to scrap a website I normally:
1 - Navigate the target website on a normal browser (ff, chrome, etc.), while monitoring/logging any
GET requests containing pertinent info via
Developer Tools ->
Pay special attention to
XHR requests, as they normally contain
json encoded data.
Here's a small video I've made exemplifying this:
You can mimic the
request headers made previously (explained in the video) and use it on a
curl request, i.e.:
Firefox. You can also use PhantomJS, a headless browser. Latest versions of GeckoDriver (used by Selenium) also support headless browsing.
I'm aware the question is about
PHP, but if the OP needs to use
Python is way more intuitive I'd say. Based on that, here's a
Selenium example in
from selenium import webdriver from selenium.webdriver.common.keys import Keys driver = webdriver.Firefox() driver.get("http://www.python.org") assert "Python" in driver.title elem = driver.find_element_by_name("q") elem.clear() elem.send_keys("pycon") elem.send_keys(Keys.RETURN) assert "No results found." not in driver.page_source driver.close()
I see two possible paths:
Simulate a browser, e.g. using Selenium. For example, this article discusses the exact challenge you mention and provides a solution using Selenium and Python.
©2020 All rights reserved.