Skip to main content
  1. Posts/

Scraping GraphQL with Python and Scrapy

·5 mins
web-tech python
Table of Contents

I’ve recently been working on a hobby project that involves scraping data from various websites. Most of these websites are old and don’t have APIs, so I have to scrape the data directly from the web pages. While some sites are easy to scrape, others are more challenging. Although I’m familiar with Scrapy, I typically use requests and BeautifulSoup due to their simplicity and Scrapy’s steep learning curve.

Some websites include pagination, requiring me to scrape multiple pages. I’ve been using Selenium with a headless browser like Chrome or Firefox ( Gecko) for these sites. This approach works until I encounter infinite scroll pagination, which requires scrolling to the bottom of the page to load more data—a process that Selenium handles but is notably slow.

Scraping 500px Photos
#

I have an old 500px account with several photos that I want to download before closing my account. Unfortunately, 500px doesn’t offer a straightforward way to download all photos at once, nor do they provide free API access. Hence, I had to resort to scraping. Fortunately, I discovered that 500px uses GraphQL for their API, which, while I’m not an expert in, I’m familiar with.

Scraping Photos from 500px with Python Requests
#

Here’s a step-by-step guide on how I managed to scrape my photos from 500px using the requests library in Python, at first:

  1. Go to the profile page of the user whose photos you want to scrape. The URL typically looks like this: https://500px.com/p/<username>.
  2. Open the developer tools (Inspect Element), go to the Network tab, and select Fetch/XHR.
  3. Clear the network tab, scroll down to load more photos, and observe the requests. Look for a request to the GraphQL endpoint.
  4. Right-click the GraphQL request and select “Copy as cURL (bash).”
  5. Use a tool like curlconverter.com to convert the cURL request to Python requests code.
  6. Execute the Python script to see the response in JSON format. You can then save this data to a file or parse it and save it to a database.

Here’s an example of the Python requests code for scraping photos from 500px:

import requests
import json

url = "https://api.500px.com/graphql"

headers = {
  "Content-Type": "application/json",
  # Add other necessary headers here (step 5)
}

json_data = {
  # GraphQL query here (step 5)
}

response = requests.post(url, headers=headers, json=json_data)
data = response.json()

for node in data['data']['profile']['user']['photos']['nodes']:
    photo_url = node['images']['url']
    # Download the photo 

Handling Pagination with Requests
#

The initial script worked fine but didn’t return all my photos at once. To scrape multiple pages, I had to update the cursor value to get the next set of photos.


Scraping Photos from 500px with Scrapy
#

Now, let’s see how to achieve the same result using Scrapy, which is more efficient for complex scraping tasks.

This method works only for public profiles. If the profile is private, you will need to log in to 500px, retrieve the necessary cookies and headers from your browser, and include them in your requests.
  1. First, install Scrapy using pip: pip install scrapy

  2. Create a new Scrapy project:

scrapy startproject FiveHundredPx
  1. Navigate to the project directory and generate a new spider:
cd FiveHundredPx
scrapy genspider a500px 500px.com

Now, let’s delve into the spider code.

Spider Code Breakdown
#

1. Importing and Defining the Spider Class
#

 1import time
 2import scrapy
 3import json
 4import urllib.request
 5
 6class A500pxSpider(scrapy.Spider):
 7  name = "500px"
 8  allowed_domains = ['500px.com', 'api.500px.com']
 9  start_urls = ['https://api.500px.com/graphql']
10  username = "<username>"
  • name: A unique name for the spider.
  • allowed_domains: Domains the spider is allowed to scrape.
  • start_urls: Initial URL to start scraping.
  • username: The 500px username whose photos we want to scrape.

2. Initiating Requests
#

def start_requests(self):
  graphql_query = {
    "operationName": "OtherPhotosPaginationContainerQuery",
    "variables": {
      "username": self.username,
      "pageSize": 20,
      "cursor": None,
      "excludeNude": False
    },
    "query": """query OtherPhotosPaginationContainerQuery($username: String!, $pageSize: Int, $cursor: String, $excludeNude: Boolean) {
          userByUsername(username: $username) {
            ...OtherPhotosPaginationContainer_user_2amW3i
            id
          }
        }

        fragment OtherPhotosPaginationContainer_user_2amW3i on User {
          photos(first: $pageSize, after: $cursor, privacy: PROFILE, sort: ID_DESC, excludeNude: $excludeNude) {
            edges {
              node {
                id
                legacyId
                canonicalPath
                width
                height
                name
                isLikedByMe
                notSafeForWork
                photographer: uploader {
                  id
                  legacyId
                  username
                  displayName
                  canonicalPath
                }
                images(sizes: [35, 33]) {
                  size
                  url
                  jpegUrl
                  webpUrl
                  id
                }
                __typename
              }
              cursor
            }
            totalCount
            pageInfo {
                endCursor
                hasNextPage
            }
          }
        }"""
  }
  url = 'https://api.500px.com/graphql'
  headers = {
    'Content-Type': 'application/json',
    # You might need to include headers for authorization if required
  }
  meta = {'username': self.username}
  yield scrapy.Request(url, method='POST', headers=headers, body=json.dumps(graphql_query), callback=self.parse_photos, meta=meta)
  • Constructs the GraphQL query and sends a POST request to the GraphQL endpoint.
  • Sets headers and metadata.
  • Uses yield to schedule the request, specifying parse_photos as the callback method.

3. Parsing Photo Data
#

def parse_photos(self, response):
  data = json.loads(response.body)
  username = response.meta['username']
  photos = data['data']['userByUsername']['photos']['edges']
  for photo in photos:
    node = photo['node']
    self.save_image(node)
    yield {
      'username': username,
      'id': node['id'],
      'title': node['name'],
      'url': node['canonicalPath'],
      'image_urls': [image['url'] for image in node['images']],
      'photographer': {
        'username': node['photographer']['username'],
        'displayName': node['photographer']['displayName'],
        'url': node['photographer']['canonicalPath']
      }
    }
  • Parses the JSON response and extracts relevant data from each photo node.
  • Calls save_image to download the photo.
  • Yields the photo metadata, which Scrapy handles for storage or further processing.

4. Handling Pagination
#

 1  page_info = data['data']['userByUsername']['photos']['pageInfo']
 2  if page_info['hasNextPage']:
 3    next_cursor = page_info['endCursor']
 4    graphql_query = "<Same as before>"
 5    url = 'https://api.500px.com/graphql'
 6    headers = {
 7      'Content-Type': 'application/json',
 8      # You might need to include headers for authorization if required
 9    }
10    meta = {'username': username}
11    yield scrapy.Request(url, method='POST', headers=headers, body=json.dumps(graphql_query), callback=self.parse_photos, meta=meta)
  • Checks if there are more pages.
  • If so, updates the cursor for the next request.
  • Schedules the next request with the updated cursor value to continue scraping.

5. Saving Images
#

def save_image(self, node):
  image_urls = [image['url'] for image in node['images']]
  first_image_url = image_urls[0]
  filename = first_image_url.split('=')[-1][:6]
  time.sleep(1)
  urllib.request.urlretrieve(first_image_url, filename)
  • Extracts image URLs from the node.
  • Generates a filename from the URL.
  • Uses urllib.request.urlretrieve to download the image.
  • Adds a delay to avoid being blocked by the server.

Running the Spider
#

To run the spider, use the following command in your terminal:

scrapy crawl 500px

This command will execute the spider, starting from the initial request, and will follow the flow defined in the code to scrape all photos from the specified 500px user profile.


Cover image generated by Bing

Related

Choosing giscus on Hugo Site
·4 mins
git hugo GitHub
Formal Methods for Software Engineering: A Semester in Review
·3 mins
buw weimar fm