Scraping LinkedIn with Javascript

May 02, 2020

I know what your question will be — Why do you want to scrape websites using Javascript? This is becuase, in software engineering, it’s not about what language you choose, its more about how you use logic to solve problems given certain constraints.

As I am learning Javascript so I decided to put a constraint on myself and try to scrape a website to solidify whatever I have learnt this past week and try to scrape LinkedIn.

Goal

The goal of this tutorial is to scrape LinkedIn for the top 5 profiles that LinkedIn shows when the user provides a search string and save the individual profile pages as a .txt file in your local file system.Here, we will learn about

What you will learn

Perform actions inside the browser, that we usually do with our mouse, through code.
Automate form submission.
Learn to save files to our local file system.

Prerequisites

This tutorial is aimed to be biginner friendly. If you are following along, you would need to install NodeJS in your PC. The source code with be available in the given repository :
Github repo

Dependencies

Puppeteer

Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome or Chromium.
It means that Puppeteer provides us with functions through which we can run Chrome or Chromium. We can emulate all the actions which we do in a browser like going to a link, clicking a button etc. without the GUI but through code.
Cheerio

Cheerio is subset of core jQuery which is basically used to do DOM manipulations.
File-System

It is a Node.js module that allows you to work with the file system of your computer.

Project Setup

Create a folder.
Navigate to the folder from your terminal. Make sure you have NodeJS installed.
In your terminal, run npm init. This will create a package.json file inside your folder.

Installing the dependencies

As we have three dependencies, we need to install them. Run the following commands in the terminal.

npm install puppeteer
npm install cheerio
npm install file-system

Breaking down the problem

Now as we are ready with all out dependencies, let us start by breaking down the project so that we can handle each individual problems and tackle it one by one. At this point, the problems which are apparent are:

We need to first login to LinkedIn. Therefore, we would need to automate form submission.
Then we would need to figure out a way to provide the search string, so that LinkedIn gives as the top 5 profiles to scrape.
Then we would need to search for the profile links of the individual profile from the search results.
Go to the individual profile links and save the pages to our file system.

As we have figured out the problems let us dive into the code.

Logging into LinkedIn

At first we go to the LogIn page where we find the selectors for our username and password. Open chrome developer tools ctrl+shift+i. and inspect the username and password field. We see that the fields are assigned an id. Remember if you are struggling with the code, the link to the repository is given above. Have a look at the picture below :

Create a file named scrape.js and add the following code.

    //importing the dependencies

    let puppeteer = require("puppeteer");
    let cheerio = require("cheerio");
    let fs = require("file-system");

    var links = [];
    var page;


    
    const EMAIL_SELECTOR = "#username";
    const PASSWORD_SELECTOR = "#password";
    const SUBMIT_SELECTOR = "#app__container > main > div > form > div.login__form_action_container > button";
    const LINKEDIN_LOGIN_URL = "https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin";    
    const SEARCH_STRING = process.argv[2];
    
    //main
    if (process.argv[2] !== undefined) {
    (() => {
        puppeteer
        .launch({ headless: true })
        .then(async (browser) => {
            page = await browser.newPage();
            page.setViewport({ width: 1366, height: 768 });
            await page.goto(LINKEDIN_LOGIN_URL, { waitUntil: "domcontentloaded" });

            await page.click(EMAIL_SELECTOR);
            //enter your email in the next line
            await page.keyboard.type("your@email.com");
            await page.click(PASSWORD_SELECTOR);
            //enter your password in the next line
            await page.keyboard.type("password");
            await page.click(SUBMIT_SELECTOR);
        })
        .catch((err) => {
            console.log(" CAUGHT WITH AN ERROR ", err);
        });
    })();
}

Here, what’s going on is that we launch a browser. The code then takes the browser to the LinkedIn Login Url and enters the username and password. Do not forget to change to your username and password in the above code. If you change headless to false in the above code, you can see a chromium browser open and login to your account automatically.

Entering the search string.

After logging in, the next step we need to do is to find a way to enter the search string. After doing some digging, I found that for getting the list of profiles based on your search string we could use a url. Look at the image below:
So, to get the list of profiles add the following code to the previous code snippet i.e. after
await page.click(SUBMIT_SELECTOR);,from the previous code snippet, add the following code to go to the list page.
```
    await page.goto(
      `https://www.linkedin.com/search/results/people/?keywords=${SEARCH_STRING}&origin=CLUSTER_EXPANSION`,
      { waitUntil: "domcontentloaded" }
    );
```
Finding the profile links.

Now we are almost at the end of the tutorial, here we just need to find the profile links to all the results we get after we enter the search string:

Add the following function to your code :

//helper function get the profile links from the target page 
async function getDataFromPage() {
  const content = page.content();
  return content
    .then((success) => {
      const $ = cheerio.load(success);
      let divClass = $(".search-results__list").find("li");
      //console.log(divClass.attr('class'));
      let listLength = divClass.length;
      //console.log(listLength);
      divClass.each(function (i, element) {
        const targetUl = $(element).find(
          ".search-result__info,.pt3, .pb4, .ph0"
        );
        //console.log(targetUl.attr('class'));
        const name = $(targetUl).find(".name, .actor-name").text();
        const link = $(targetUl).find("a").attr("href");
        links.push(link);
      });
      return links;
    })
    .catch((err) => {
      console.log(" CAUGHT WITH AN ERROR ", err);
    });
}

What the above code does is that it selects the profile links with the help of the selectors and adds it to an array named links. Then we make a function call to the above function. Add the following lines into the main code
var data = await getDataFromPage();

Going to the individual profile links and save the pages to our local system.

After add the line from the above point, add the following lines of code.

for (let i = 0; i < data.length; i++) {
      //console.log("Link " + i + " : " + data[i]);
      let newPage = await browser.newPage();
      let newurl = "https://www.linkedin.com" + data[i];
      await newPage.goto(newurl, { waitUntil: "networkidle2" });
      let childHtml = await newPage.content();
      fs.writeFile(`file${i}.txt`, childHtml);
      //console.log(childHtml);
      newPage.close();
}

The above code loops through the links array and goes to the individual profile pages. We save the pages in the form of an .txt file. One thing to rememeber, that it is always a good practice to close the headless browser after scraping. So, add the following line to close the browser after the for loop.
browser.close();

With this we are done with our code. To run it, in your terminal navigate to your project directory and type the following in node scrape "your search string".

And keep in mind that you can always check the github repo for reference.

Goal

What you will learn

Prerequisites

Dependencies

Puppeteer

Cheerio

File-System

Project Setup

Installing the dependencies

Breaking down the problem

Logging into LinkedIn

Entering the search string.

Finding the profile links.

Going to the individual profile links and save the pages to our local system.