Image by Millicent -

Are you tired of scraping websites and ending up with a plethora of duplicate URLs? Do you want to extract all the URLs from a webpage without any duplicates? Look no further! In this article, we’ll show you how to use Python Scrapy to get all URLs in a webpage without duplicates.

What is Scrapy?

Scrapy is a Python framework used for building web scrapers. It provides a flexible and efficient way to extract data from websites. With Scrapy, you can write spiders that navigate websites, extract data, and store it in a structured format.

Why Do We Need to Remove Duplicates?

Duplicate URLs can be a nuisance when scraping websites. They can lead to:

  • Wasted resources: Scraping the same URL multiple times can waste resources and slow down your scraper.
  • Inaccurate data: Duplicate URLs can lead to inaccurate data, especially if you’re counting the number of URLs or unique pages.
  • Increased storage: Storing duplicate URLs can take up unnecessary storage space.

How to Get All URLs in a Webpage Without Duplicates Using Scrapy

To get all URLs in a webpage without duplicates using Scrapy, you’ll need to:

  1. Install Scrapy: If you haven’t already, install Scrapy using pip: pip install scrapy
  2. Create a new Scrapy project: Run scrapy startproject projectname to create a new Scrapy project.
  3. Define your spider: In your project’s spiders directory, create a new Python file (e.g., and define your spider:
import scrapy

class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = [
        '',  # replace with your starting URL

    def parse(self, response):
        # extract all URLs from the webpage
        urls = response.css('a::attr(href)').getall()

        # remove duplicates using a set
        unique_urls = set(urls)

        # yield the unique URLs
        for url in unique_urls:
            yield {'url': url}

How the Code Works

The code above defines a Scrapy spider that starts at the URL The parse method is called on each webpage, and it:

  • Extracts all URLs from the webpage using the CSS selector a::attr(href).
  • Removes duplicates by converting the list of URLs to a set.
  • Yields each unique URL as a dictionary.

Running the Spider

To run the spider, navigate to your project’s directory and run:

scrapy crawl myspider

This will start the spider and begin scraping the website. You can view the output in the terminal or save it to a file using:

scrapy crawl myspider -o output.csv

Additional Tips and Variations

Handling Relative URLs

If you encounter relative URLs (e.g., /about instead of, you can use the response.urljoin method to convert them to absolute URLs:

urls = [response.urljoin(url) for url in response.css('a::attr(href)').getall()]

Excluding Certain URLs

If you want to exclude certain URLs (e.g., URLs containing # or ?), you can use a list comprehension:

unique_urls = {url for url in unique_urls if '#' not in url and '?' not in url}

If you want to follow links and scrape multiple pages, you can modify the spider to yield a new request for each URL:

def parse(self, response):
    # ...
    for url in unique_urls:
        yield response.follow(url, self.parse)


In this article, we’ve shown you how to use Python Scrapy to get all URLs in a webpage without duplicates. By using a set to remove duplicates and modifying the code to handle relative URLs, exclude certain URLs, and follow links, you can create a robust web scraper that extracts the data you need. Happy scraping!

Next time you’re scraping websites, remember to remove those pesky duplicates and make your scraping life easier!

