How to perform unauthenticated Instagram web scraping in response to recent private API changes?

Months ago, Instagram began rendering their public API inoperable by removing most features and refusing to accept new applications for most permissions scopes. Further changes were made this week which further constricts developer options.

Many of us have turned to Instagram's private web API to implement the functionality we previously had. One standout ping/instagram_private_api manages to rebuild most of the prior functionality, however, with the publicly announced changes this week, Instagram also made underlying changes to their private API, requiring in magic variables, user-agents, and MD5 hashing to make web scraping requests possible. This can be seen by following the recent releases on the previously linked git repository, and the exact changes needed to continue fetching data can be seen here.

These changes include:

  • Persisting the User Agent & CSRF token between requests.
  • Making an initial request to https://instagram.com/ to grab an rhx_gis magic key from the response body.
  • Setting the X-Instagram-GIS header, which is formed by magically concatenating the rhx_gis key and query variables before passing them through an MD5 hash.

Anything less than this will result in a 403 error. These changes have been implemented successfully in the above repository, however, my attempt in JS continues to fail. In the below code, I am attempting to fetch the first 9 posts from a user timeline. The query parameters which determine this are:

  • query_hash of 42323d64886122307be10013ad2dcc44 (fetch media from the user's timeline).
  • variables.id of any user ID as a string (the user to fetch media from).
  • variables.first, the number of posts to fetch, as an integer.

Previously, this request could be made without any of the above changes by simply GETting from https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables=%7B%22id%22%3A%225380311726%22%2C%22first%22%3A1%7D, as the URL was unprotected.

However, my attempt at implementing the functionality to successfully written in the above repository is not working, and I only receive 403 responses from Instagram. I'm using superagent as my requests library, in a node environment.

/*
** Retrieve an arbitrary cookie value by a given key.
*/
const getCookieValueFromKey = function(key, cookies) {
        const cookie = cookies.find(c => c.indexOf(key) !== -1);
        if (!cookie) {
            throw new Error('No key found.');
        }
        return (RegExp(key + '=(.*?);', 'g').exec(cookie))[1];
    };

/*
** Calculate the value of the X-Instagram-GIS header by md5 hashing together the rhx_gis variable and the query variables for the request.
*/
const generateRequestSignature = function(rhxGis, queryVariables) {
    return crypto.createHash('md5').update(`${rhxGis}:${queryVariables}`, 'utf8').digest("hex");
};

/*
** Begin
*/
const userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/604.3.5 (KHTML, like Gecko) Version/11.0.1 Safari/604.3.5';

// Make an initial request to get the rhx_gis string
const initResponse = await superagent.get('https://www.instagram.com/');
const rhxGis = (RegExp('"rhx_gis":"([a-f0-9]{32})"', 'g')).exec(initResponse.text)[1];

const csrfTokenCookie = getCookieValueFromKey('csrftoken', initResponse.header['set-cookie']);

const queryVariables = JSON.stringify({
    id: "123456789",
    first: 9
});

const signature = generateRequestSignature(rhxGis, queryVariables);

const res = await superagent.get('https://www.instagram.com/graphql/query/')
    .query({
        query_hash: '42323d64886122307be10013ad2dcc44',
        variables: queryVariables
    })
    .set({
        'User-Agent': userAgent,
        'X-Instagram-GIS': signature,
        'Cookie': `rur=FRC;csrftoken=${csrfTokenCookie};ig_pr=1`
    }));

What else should I try? What makes my code fail, and the provided code in the repository above work just fine?

Update (2018-04-17)

For at least the 3rd time in a week, Instagram has again updated their API. The change no longer requires the CSRF Token to form part of the hashed signature.

The question above has been updated to reflect this.

Update (2018-04-14)

Instagram has again updated their private graphql API. As far as anyone can figure out:

  • User Agent is no longer needed to be included in the X-Instagram-Gis md5 calculation.

The question above has been updated to reflect this.

Answers:

Answer

Values to persist

You aren't persisting the User Agent (a requirement) in the first query to Instagram:

const initResponse = await superagent.get('https://www.instagram.com/');

Should be:

const initResponse = await superagent.get('https://www.instagram.com/')
                     .set('User-Agent', userAgent);

This must be persisted in each request, along with the csrftoken cookie.

X-Instagram-GIS header generation

As your answer shows, you must generate the X-Instagram-GIS header from two properties, the rhx_gis value which is found in your initial request, and the query variables in your next request. These must be md5 hashed, as shown in your function above:

const generateRequestSignature = function(rhxGis, queryVariables) {
    return crypto.createHash('md5').update(`${rhxGis}:${queryVariables}`, 'utf8').digest("hex");
};
Answer

So in order to call instagram query you need to generate x-instagram-gis header.

To generate this header you need to calculate a md5 hash of the next string "{rhx_gis}:{path}". The rhx_gis value is stored in the source code of instagram page in the window._sharedData global js variable.

Example:
If you try to GET user info request like this https://www.instagram.com/{username}/?__a=1
You need to add http header x-instagram-gis to request which value is
MD5("{rhx_gis}:/{username}/")

This is tested and works 100%, so feel free to ask if something goes wrong.

Answer

Uhm... I don't have Node installed on my machine, so I cannot verify for sure, but looks like to me that you are missing a crucial part of the parameters in querystring, that is the after field:

const queryVariables = JSON.stringify({
    id: "123456789",
    first: 4,
    after: "YOUR_END_CURSOR"
});

From those queryVariables depend your MD5 hash, that, then, doesn't match the expected one. Try that: I expect it to work.

EDIT:

Reading carefully your code, it doesn't make much sense unfortunately. I infer that you are trying to fetch the full stream of pictures from a user's feed.

Then, what you need to do is not calling the Instagram home page as you are doing now (superagent.get('https://www.instagram.com/')), but rather the user's stream (superagent.get('https://www.instagram.com/your_user')).

Beware: you need to hardcode the very same user agent you're going to use below (and it doesn't look like you are...).

Then, you need to extract the query ID (it's not hardcoded, it changes every few hours, sometimes minutes; hardcoding it is foolish – however, for this POC, you can keep it hardcoded), and the end_cursor. For the end cursor I'd go for something like this:

const endCursor = (RegExp('end_cursor":"([^"]*)"', 'g')).exec(initResponse.text)[1];

Now you have everything you need to make the second request:

const queryVariables = JSON.stringify({
    id: "123456789",
    first: 9,
    after: endCursor
});

const signature = generateRequestSignature(rhxGis, csrfTokenCookie, queryVariables);

const res = await superagent.get('https://www.instagram.com/graphql/query/')
    .query({
        query_hash: '42323d64886122307be10013ad2dcc44',
        variables: queryVariables
    })
    .set({
        'User-Agent': userAgent,
        'Accept': '*/*',
        'Accept-Language': 'en-US',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'close',
        'X-Instagram-GIS': signature,
        'Cookie': `rur=${rurCookie};csrftoken=${csrfTokenCookie};mid=${midCookie};ig_pr=1`
    }).send();
Answer

query_hash is not constant and keep changing over time.

For example ProfilePage scripts included these scripts:

https://www.instagram.com/static/bundles/base/ConsumerCommons.js/9e645e0f38c3.js https://www.instagram.com/static/bundles/base/Consumer.js/1c9217689868.js

The hash is located in one of the above script, e.g. for edge_followed_by:

const res = await fetch(scriptUrl, { credentials: 'include' });
const rawBody = await res.text();
const body = rawBody.slice(0, rawBody.lastIndexOf('edge_followed_by'));
const hashes = body.match(/"\w{32}"/g);
// hashes[hashes.length - 2]; = edge_followed_by
// hashes[hashes.length - 1]; = edge_follow

Tags

Recent Questions

Top Questions

Home Tags Terms of Service Privacy Policy DMCA Contact Us

©2020 All rights reserved.