Apify and Crawlee Official Forum

Updated 12 months ago

Apify not executing scrapy's close_spider function and going insane after it finishes scraping.

Hello, I got a little problem. As I said in the title, my script does not execute the close_spider function and when the scraping finishes, it goes in an infinite loops. I guess that's why close_spider doesn't get executed. Can anyone help?
Attachment
image.png
1
A
O
K
33 comments
just advanced to level 4! Thanks for your contributions! πŸŽ‰
Hi,
Please provide some reproduction (code snippet).
It's hard to help without seeing it.
Possibly You have a bug somewhere.
the spider code
Plain Text
async def main() -> None:
    async with Actor:
        Actor.log.info('Actor is being executed...')
        
        # Process Actor input
        actor_input = await Actor.get_input() or {}
        max_depth = actor_input.get('max_depth', 1)
        start_urls = ['https://gelsf.com/', 'vowconstruction.com/', 'https://prosperdevelopment.com/', 'https://missionhomeremodeling.com/', 'https://www.leefamilycorp.com/', 'https://www.a2zremodelingcal.com/', 'https://lemusco.com/', 'https://www.agcsf.com/', 'https://www.goldenheightsremodeling.com/']
        settings = _get_scrapy_settings(max_depth)
        domain = []
        def get_domain(url):
            try:
                if not urlparse(url).scheme:
                    url = 'http://' + url 
                parsed_url = urlparse(url)
                domain = parsed_url.netloc
                if domain.startswith('www.'):
                    domain = domain[4:]
                return domain
            except:
                print(f'invalid url : {url}')
        for i in start_urls:
            a = get_domain(i)
            domain.append(a)
    

        process = CrawlerProcess(settings, install_root_handler=False)
        process.crawl(Spider, domain=domain, urls=start_urls)
        process.start()
        print('Finished scraping. Cleaning data...')
the main function code
the problem is that it infinites run for a reason even after finishing the crawling
Hi, I was not able to reproduce it since the provided code snippets appear to be incomplete. Despite filling in the missing imports and attempting to execute the code, I encountered the following error:

Plain Text
AttributeError: 'Testing' object has no attribute 'is_valid_url'


Could you please provide the complete and functional code of your Actor?

Additionally, providing a link to the run of your Actor would be helpful as well.
just advanced to level 1! Thanks for your contributions! πŸŽ‰
Plain Text
 def is_valid_url(self, url):
        try:
            parsed = urlparse(url)
            return True
        except Exception as e:
            print(f"Error validating URL: {e}")
            return False
Plain Text
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from urllib.parse import urlparse
from apify import Actor

from apify import Actor

from ..items import Items
import scrapy
import re
import json
it does an infinite run for a reason that I ignore after finishing scraping everything
Hey, I did some investigation... If you add an Item Pipeline to your Scrapy-Apify project (based on the Scrapy Actor template) it works, and close_spider method is correctly called after the spider finishes its work. I even tried to use your DataCleaningPipeline pipeline, and it works, there is not a bug in it.

The problem has to be somewhere in your Spider and/or in the main.py. I suggest you to keep the main.py as simple as possible, e.g. like this:

Plain Text
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from apify import Actor
from .spiders.title import TitleSpider as Spider

def _get_scrapy_settings() -> Settings:
    settings = get_project_settings()
    settings['ITEM_PIPELINES']['apify.scrapy.pipelines.ActorDatasetPushPipeline'] = 1000
    settings['DOWNLOADER_MIDDLEWARES']['scrapy.downloadermiddlewares.retry.RetryMiddleware'] = None
    settings['DOWNLOADER_MIDDLEWARES']['apify.scrapy.middlewares.ApifyRetryMiddleware'] = 1000
    settings['SCHEDULER'] = 'apify.scrapy.scheduler.ApifyScheduler'
    return settings

async def main() -> None:
    async with Actor:
        Actor.log.info('Actor is being executed...')
        settings = _get_scrapy_settings()
        process = CrawlerProcess(settings, install_root_handler=False)
        process.crawl(Spider)
        process.start()
And move the start_urls, domain, and other related logic to the Spider (class attributes). And then try to debug your Spider code.

Plain Text
...

class TestSpider(Spider):
    name = 'test'
    second_pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-]+')
    email_pattern = re.compile(r'(?:mailto:)?[A-Za-z0-9._%+-]+@[A-Za-z.-]+\.[A-Za-z]{2,}')
    links_pattern = re.compile(r'(twitter\.com|facebook\.com|instagram\.com|linkedin\.com)/')
    phone_pattern = re.compile(r'tel:\+\d+')

    start_urls = [
        'https://gelsf.com/',
        'https://vowconstruction.com/',
        'https://prosperdevelopment.com/',
        'https://missionhomeremodeling.com/',
        'https://www.leefamilycorp.com/',
        'https://www.a2zremodelingcal.com/',
        'https://lemusco.com/',
        'https://www.agcsf.com/',
        'https://www.goldenheightsremodeling.com/',
    ]

    allowed_domains = [
        'gelsf.com',
        'vowconstruction.com',
        'prosperdevelopment.com',
        'missionhomeremodeling.com',
        'www.leefamilycorp.com',
        'www.a2zremodelingcal.com',
        'lemusco.com',
        'www.agcsf.com',
        'www.goldenheightsremodeling.com',
    ]

    headers = {
        ...
    }

    ...
The problem is thzt
I dont know what the allowed domains will be
Nor the start urls
No, the allowed domains
Is by user input so I dont know what will the user input be
I removed the apify.scrapy.pipelines.ActorDataSet
Because it was pushing the non cleaned items
And even with that it was still infinite running
Perhaps should I put everything on one file ?
Instead of a project
I removed the apify.scrapy.pipelines.ActorDataSet
Because it was pushing the non cleaned items

I believe for this use case you don't have to remove ActorPushDatasetPipeline, you should rather implement your own cleaning pipeline, which will be executed before the Push pipeline (which should be the latest one). Example of such CleaningPipeline:

Plain Text
class CleaningPipeline:
    def process_item(self, item: BookItem, spider: Spider) -> BookItem:
        number_map = {
            'one': 1,
            'two': 2,
            'three': 3,
            'four': 4,
            'five': 5,
        }
        return BookItem(
            title=item['title'],
            price=float(item['price'].replace('Β£', '')),
            rating=number_map[item['rating'].split(' ')[1].lower()],
            in_stock=bool(item['in_stock'].lower() == 'in stock'),
        )
There's another problem
I have to do the way I do it
I dont know what the allowed domains will be
Nor the start urls
No, the allowed domains
Is by user input so I dont know what will the user input be

I got it. But for the purpose of debugging you can select one possible input and hard code it into Spider.
I'll do that. Ty
Yeah, I believe the problem has to be in the Spider, because if I try it with other one, it works.
On local run without apify, it works very well
Add a reply
Sign up and join the conversation on Discord