How can I scrape pages with dynamic content using node.js?

I am trying to scrape a website but I don't get some of the elements, because these elements are dynamically created.

I use the cheerio in node.js and My code is below.

var request = require('request');
var cheerio = require('cheerio');
var url = "http://www.bdtong.co.kr/index.php?c_category=C02";

request(url, function (err, res, html) {
    var $ = cheerio.load(html);
    $('.listMain > li').each(function () {
        console.log($(this).find('a').attr('href'));
    });
});

This code returns empty response, because when the page is loaded, the <ul id="store_list" class="listMain"> is empty.

The content has not been appended yet.

How can I get these elements using node.js? How can I scrape pages with dynamic content?

Answers:

Answer

Here you go;

var phantom = require('phantom');

phantom.create(function (ph) {
  ph.createPage(function (page) {
    var url = "http://www.bdtong.co.kr/index.php?c_category=C02";
    page.open(url, function() {
      page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
        page.evaluate(function() {
          $('.listMain > li').each(function () {
            console.log($(this).find('a').attr('href'));
          });
        }, function(){
          ph.exit()
        });
      });
    });
  });
});
Answer

Use the new npm module x-ray, with a pluggable web driver x-ray-phantom.

Examples in the pages above, but here's how to do dynamic scraping:

var phantom = require('x-ray-phantom');
var Xray = require('x-ray');

var x = Xray()
  .driver(phantom());

x('http://google.com', 'title')(function(err, str) {
  if (err) return done(err);
  assert.equal('Google', str);
  done();
})
Answer

Easiest and reliable solution is to use puppeteer. As mentioned in https://pusher.com/tutorials/web-scraper-node which is suitable for both static + dynamic scrapping.

Only change the timeout in Browser.js, TimeoutSettings.js, Launcher.js 300000 to 3000000

Answer

Check out GoogleChrome/puppeteer

Headless Chrome Node API

It makes scraping pretty trivial. The following example will scrape the headline over at npmjs.com (assuming .npm-expansions remains)

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto('https://www.npmjs.com/');

  const textContent = await page.evaluate(() => {
    return document.querySelector('.npm-expansions').textContent
  });

  console.log(textContent); /* No Problem Mate */

  browser.close();
})();

evaluate will allow for the inspection of the dynamic element as this will run scripts on the page.

Tags

Recent Questions

Top Questions

Home Tags Terms of Service Privacy Policy DMCA Contact Us

©2020 All rights reserved.