How to send post request (I'm doing reverse engineering...

At a glance

I'm conducting reverse engineering and have discovered a link that retrieves all the data I need using the POST method. I've copied the request as cURL to analyze the parameters required for making the correct request.

I've modified the parameters to make the request using the POST method. I've successfully tested this using httpx, but now I want to implement it using the Crawlee framework.

How can I change the method used by the HTTP client to retrieve the data, and how can I pass the modified parameters I've prepared?

Additionally, if anyone has experience, I'd appreciate any insights on handling POST requests within this framework.

Thanks

9 comments

MMantisus

Hey @frankman

Here's an example in the documentation - https://crawlee.dev/python/docs/examples/fill-and-submit-web-form

ffrankman

Hi, the above answser doesn't work for me. I have found this open issue and may be it is related because I'm trying to do a POST request and I'm not getting any data.

https://github.com/apify/crawlee-python/issues/560

I'm doing this:

Here how I adding the request:

Plain Text

async def main() -> None:
    async with Actor:
        crawler = BeautifulSoupCrawler()

        url = "https://www.MY_URL.com?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

        initial_req = Request.from_url(
            method="POST",
            url=str(url),
        )

        @crawler.router.default_handler
        async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
            context.log.info(f"Processing {context.request.url}")
            await context.push_data(context.request.model_dump_json())

        # Run the crawler
        await crawler.run([initial_req])

Here the response when I want to save the json:

Plain Text

{
  "url": "MY_URL",
  "unique_key": "MY_URL",
  "method": "POST",
  "headers": {},
  "query_params": {},
  "payload": null,
  "data": {},
  "user_data": {
    "__crawlee": {
      "state": 3
    }
  },
  "retry_count": 0,
  "no_retry": false,
  "loaded_url": "MY_URL",
  "handled_at": null,
  "id": "iEYRVLtHdfdR7s6",
  "json_": null,
  "order_no": null
}

ffrankman

The issue also is related with this PR: https://github.com/apify/crawlee-python/pull/542

I'm adding this url to follow this issue. I'm interested in help because I'm using crawlee and apify a lot.

AApifyBot

@frankman just advanced to level 1! Thanks for your contributions! 🎉

MMantisus

Hey, @frankman

Yes, I created issue 560 🙂

About your URL. I don't see any payload in it. That is, you pass all the parameters as link parameters, not in the body of the POST request.

Are you sure you are creating it correctly?

Are you doing the same thing using HTTPX?

If you look at how the site sees it using httpbin.org/post

you'll get this response format

Plain Text

{
  "args": {
    "categoryId": "4555genreId=undefined", 
    "eventCountryType": "0", 
    "eventViewType": "0", 
    "fromPrice": "undefined", 
    "gridFilterType": "0", 
    "homeAwayFilterType": "0", 
    "method": "GetFilteredEvents", 
    "nearbyGridRadius": "50", 
    "opponentCategoryId": "0", 
    "pageIndex": "1", 
    "sortBy": "0", 
    "toPrice": "undefined", 
    "venueIdFilterType": "0"
  }, 
  "data": "", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Content-Length": "0", 
    "Host": "httpbin.org", 
    "User-Agent": "python-httpx/0.27.2", 
    "X-Amzn-Trace-Id": "Root=1-67100e24-37616e605f9cf31e5538556b"
  }, 
  "json": null, 
  "origin": "91.240.96.149", 
  "url": "https://httpbin.org/post?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555genreId%3Dundefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"
}

This is completely correct for your example, all parameters are in args

You'll also see an error in your URL 🙂

You forgot the & before the genreId parameter

The correct URL should be

Plain Text

url = "https://www.MY_URL.com?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

ffrankman

Sorry, I have deleted the domain name and some parameters. I have a mistake. You analyze in base of that. I will put the original link so you can check it again.

Plain Text

async def main() -> None:
    async with Actor:
        crawler = BeautifulSoupCrawler()

        url = "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

        initial_req = Request.from_url(
            method="POST",
            url=str(url),
        )

        @crawler.router.default_handler
        async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
            context.log.info(f"Processing {context.request.url}")
            await context.push_data(context.request.model_dump_json())

        await crawler.run([initial_req])

Continues 🧵

ffrankman

The output was:

Plain Text

{
  "url": "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined",
  "unique_key": "https://www.viagogo.com/concert-tickets/pop-rock/dance-pop/shakira-tickets?categoryid=4555&eventcountrytype=0&eventviewtype=0&from=1970-01-01t00%3a00%3a00.000z&fromprice=undefined&genreid=undefined&gridfiltertype=0&homeawayfiltertype=0&lat=39.044&lon=-77.488&method=getfilteredevents&nearbygridradius=50&opponentcategoryid=0&pageindex=1&radiusfrom=80467&radiusto=null&sortby=0&to=9999-12-30t23%3a00%3a00.000z&toprice=undefined&venueidfiltertype=0",
  "method": "POST",
  "headers": {},
  "query_params": {},
  "payload": null,
  "data": {},
  "user_data": {
    "__crawlee": {
      "state": 3
    }
  },
  "retry_count": 0,
  "no_retry": false,
  "loaded_url": "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined",
  "handled_at": null,
  "id": "iEYRVLtHdfdR7s6",
  "json_": null,
  "order_no": null
}

If I do the same but only with httpx:

Plain Text

resp = httpx.post(url)
print(resp.json()

> output: 

{'items': [{'eventId': 153433356,
   'name': 'Shakira',
   'url': 'https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets/E-153433356',
   'dayOfWeek': 'Wed',
...
}

MMantisus

Hi. All the code works correctly.

The problem is exactly what you are doing.

context.request.model_dump_json() - as you can see, it outputs the Request metadata, which does not include the server response

As a result, you are comparing the request metadata from crawlee with the server response in httpx...

I don't really understand why you need BeautifulSoupCrawler when working with json. I think it would be more appropriate to use ParselCrawler or HttpCrawler with a convenient library for working with json.

Here is a sample code that will do what you expect it to do

Plain Text

async def main() -> None:
    async with Actor:
        crawler = BeautifulSoupCrawler()

        url = "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

        initial_req = Request.from_url(
            method="POST",
            url=str(url),
        )

        @crawler.router.default_handler
        async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
            context.log.info(f"Processing {context.request.url}")

            await context.push_data(context.soup.find("p").text)

        await crawler.run([initial_req])

ffrankman

You're right Mantisus, now I'm using HttpCrawler() and I'm getting the data I want:

This code does what I want:

Plain Text

from apify import Actor
from crawlee import Request
from crawlee.http_crawler import HttpCrawler, HttpCrawlingContext
import json


async def main() -> None:
    async with Actor:
        crawler = HttpCrawler()

        url = "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

        initial_req = Request.from_url(
            method="POST",
            url=str(url),
        )

        @crawler.router.default_handler
        async def default_handler(context: HttpCrawlingContext) -> None:
            context.log.info(f"Processing {context.request.url}")
            json_response = context.http_response.read()  # <------ This is the same than this: response.json() after doing response = httpx.post(url)
            json_resp_parsed = json.loads(json_response)
            await context.push_data(json_resp_parsed)

        await crawler.run([initial_req])

Thanks Mantisus

Add a reply

Apify and Crawlee Official Forum

How to send post request (I'm doing reverse engineering)