Apify not executing scrapy's close_spider function and ...

KKirada

Hello, I got a little problem. As I said in the title, my script does not execute the close_spider function and when the scraping finishes, it goes in an infinite loops. I guess that's why close_spider doesn't get executed. Can anyone help?

Attachment

33 comments

AApifyBot

just advanced to level 4! Thanks for your contributions! 🎉

OOleg V.

Hi,
Please provide some reproduction (code snippet).
It's hard to help without seeing it.
Possibly You have a bug somewhere.

KKirada

the spider code

KKirada

Plain Text

async def main() -> None:
    async with Actor:
        Actor.log.info('Actor is being executed...')
        
        # Process Actor input
        actor_input = await Actor.get_input() or {}
        max_depth = actor_input.get('max_depth', 1)
        start_urls = ['https://gelsf.com/', 'vowconstruction.com/', 'https://prosperdevelopment.com/', 'https://missionhomeremodeling.com/', 'https://www.leefamilycorp.com/', 'https://www.a2zremodelingcal.com/', 'https://lemusco.com/', 'https://www.agcsf.com/', 'https://www.goldenheightsremodeling.com/']
        settings = _get_scrapy_settings(max_depth)
        domain = []
        def get_domain(url):
            try:
                if not urlparse(url).scheme:
                    url = 'http://' + url 
                parsed_url = urlparse(url)
                domain = parsed_url.netloc
                if domain.startswith('www.'):
                    domain = domain[4:]
                return domain
            except:
                print(f'invalid url : {url}')
        for i in start_urls:
            a = get_domain(i)
            domain.append(a)
    

        process = CrawlerProcess(settings, install_root_handler=False)
        process.crawl(Spider, domain=domain, urls=start_urls)
        process.start()
        print('Finished scraping. Cleaning data...')

KKirada

the main function code

KKirada

the problem is that it infinites run for a reason even after finishing the crawling

VVlada Dusek

Hi, I was not able to reproduce it since the provided code snippets appear to be incomplete. Despite filling in the missing imports and attempting to execute the code, I encountered the following error:

Plain Text

AttributeError: 'Testing' object has no attribute 'is_valid_url'

Could you please provide the complete and functional code of your Actor?

Additionally, providing a link to the run of your Actor would be helpful as well.

AApifyBot

just advanced to level 1! Thanks for your contributions! 🎉

KKirada

Plain Text

 def is_valid_url(self, url):
        try:
            parsed = urlparse(url)
            return True
        except Exception as e:
            print(f"Error validating URL: {e}")
            return False

KKirada

Plain Text

from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from urllib.parse import urlparse
from apify import Actor

from apify import Actor

from ..items import Items
import scrapy
import re
import json

KKirada

https://console.apify.com/view/runs/RfltIyAhCoccch9Vu

KKirada

it does an infinite run for a reason that I ignore after finishing scraping everything

VVlada Dusek

Hey, I did some investigation... If you add an Item Pipeline to your Scrapy-Apify project (based on the Scrapy Actor template) it works, and close_spider method is correctly called after the spider finishes its work. I even tried to use your DataCleaningPipeline pipeline, and it works, there is not a bug in it.

The problem has to be somewhere in your Spider and/or in the main.py. I suggest you to keep the main.py as simple as possible, e.g. like this:

Plain Text

from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from apify import Actor
from .spiders.title import TitleSpider as Spider

def _get_scrapy_settings() -> Settings:
    settings = get_project_settings()
    settings['ITEM_PIPELINES']['apify.scrapy.pipelines.ActorDatasetPushPipeline'] = 1000
    settings['DOWNLOADER_MIDDLEWARES']['scrapy.downloadermiddlewares.retry.RetryMiddleware'] = None
    settings['DOWNLOADER_MIDDLEWARES']['apify.scrapy.middlewares.ApifyRetryMiddleware'] = 1000
    settings['SCHEDULER'] = 'apify.scrapy.scheduler.ApifyScheduler'
    return settings

async def main() -> None:
    async with Actor:
        Actor.log.info('Actor is being executed...')
        settings = _get_scrapy_settings()
        process = CrawlerProcess(settings, install_root_handler=False)
        process.crawl(Spider)
        process.start()

VVlada Dusek

And move the start_urls, domain, and other related logic to the Spider (class attributes). And then try to debug your Spider code.

Plain Text

...

class TestSpider(Spider):
    name = 'test'
    second_pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-]+')
    email_pattern = re.compile(r'(?:mailto:)?[A-Za-z0-9._%+-]+@[A-Za-z.-]+\.[A-Za-z]{2,}')
    links_pattern = re.compile(r'(twitter\.com|facebook\.com|instagram\.com|linkedin\.com)/')
    phone_pattern = re.compile(r'tel:\+\d+')

    start_urls = [
        'https://gelsf.com/',
        'https://vowconstruction.com/',
        'https://prosperdevelopment.com/',
        'https://missionhomeremodeling.com/',
        'https://www.leefamilycorp.com/',
        'https://www.a2zremodelingcal.com/',
        'https://lemusco.com/',
        'https://www.agcsf.com/',
        'https://www.goldenheightsremodeling.com/',
    ]

    allowed_domains = [
        'gelsf.com',
        'vowconstruction.com',
        'prosperdevelopment.com',
        'missionhomeremodeling.com',
        'www.leefamilycorp.com',
        'www.a2zremodelingcal.com',
        'lemusco.com',
        'www.agcsf.com',
        'www.goldenheightsremodeling.com',
    ]

    headers = {
        ...
    }

    ...

KKirada

The problem is thzt

KKirada

I dont know what the allowed domains will be

KKirada

Nor the start urls

KKirada

No, the allowed domains

KKirada

Is by user input so I dont know what will the user input be

KKirada

I removed the apify.scrapy.pipelines.ActorDataSet

KKirada

Because it was pushing the non cleaned items

KKirada

And even with that it was still infinite running

KKirada

Perhaps should I put everything on one file ?

KKirada

Instead of a project

VVlada Dusek

I removed the apify.scrapy.pipelines.ActorDataSet
Because it was pushing the non cleaned items

I believe for this use case you don't have to remove ActorPushDatasetPipeline, you should rather implement your own cleaning pipeline, which will be executed before the Push pipeline (which should be the latest one). Example of such CleaningPipeline:

Plain Text

class CleaningPipeline:
    def process_item(self, item: BookItem, spider: Spider) -> BookItem:
        number_map = {
            'one': 1,
            'two': 2,
            'three': 3,
            'four': 4,
            'five': 5,
        }
        return BookItem(
            title=item['title'],
            price=float(item['price'].replace('£', '')),
            rating=number_map[item['rating'].split(' ')[1].lower()],
            in_stock=bool(item['in_stock'].lower() == 'in stock'),
        )

KKirada

There's another problem

KKirada

I have to do the way I do it

VVlada Dusek

I dont know what the allowed domains will be
Nor the start urls
No, the allowed domains
Is by user input so I dont know what will the user input be

I got it. But for the purpose of debugging you can select one possible input and hard code it into Spider.

KKirada

I'll do that. Ty

VVlada Dusek

Yeah, I believe the problem has to be in the Spider, because if I try it with other one, it works.

KKirada

I believe too

KKirada

On local run without apify, it works very well

Add a reply

Join on Discord

Apify and Crawlee Official Forum

Apify not executing scrapy's close_spider function and going insane after it finishes scraping.