I have not systematically studied web scraping before, so I am making notes while watching this video.
Request / Response
A request is information sent by a browser or program to the server to request content for display. The response is the server’s reply to the received request.
Request content
- Request method: GET / POST (most commonly used)
- POST has additional form data compared to GET
- Parameters in GET are directly included in the URL, while parameters in POST requests are included in the form
- Request URL: A Uniform Resource Locator (URL) is a link to a file or object
- Request headers: Important configuration information stored as key-value pairs
- Request body: Generally no information for GET, but required for POST
- Request method: GET / POST (most commonly used)
Response content
- Response status code: A numeric code used to indicate the status of the request
- Response headers
- Response body: The result of the request
Python requests module usage
requests is based on urllib3 and provides a more convenient and feature-rich interface
GET Request Related
Adding request parameters
- The
params
parameter can conveniently add parameters to the request, avoiding manual URL construction
1 | data = { |
JSON parsing
- Provides a
json
method that directly converts the returned JSON string into a JSON object
1 | response = requests.get('url', params=data) |
Binary data retrieval
- Use GET to request images directly, then write them to a file in ‘wb’ mode
Adding headers
- Mainly for successful retrieval; some websites identify User-Agent to prevent machine scraping
1 | headers = { |
POST Request Related
Adding request parameters / headers
- Similar to GET (see above)
Response attributes
- Common attributes include:
- status_code
- headers
- cookies
- url
- history
Status code analysis
- The requests library itself categorizes status codes, so you can quickly determine if the request was successful by calling built-in information. For example,
response.status_code == requests.codes.ok
is equivalent to200
1 | response = requests.get('url') |
Proxy settings
1 | proxy = { |
If using ss, you need to install additional plugins.
1 | pip install 'requests[socks]' |
1 | proxy = { |
Timeout settings
Can be combined with try
for exception handling.
1 | response = requests.get('url', timeout=1) |
Selenium Section
For content obtained through JavaScript and rendered on the page, some elements may not be found when analyzing page requests. At this time, Selenium can control the browser to perform operations. Although this is less efficient, it is very suitable for someone like me who is not familiar with web-related things.
Element Location
To simulate webpage operations, you first need to find the location of the element to be operated on. Selenium provides multiple ways to locate elements; using find_element
will return the first matching element, while using find_elements
will return a list of all objects.
1 | browser = webdriver.Firefox() |
iframe Location
A webpage may be divided into several parts, with a large frame containing one or more iframes
. When located within the main frame, it is impossible to search for or locate content within the small frames, and thus cannot perform corresponding operations. Therefore, you need to switch frames before performing operations.
1 | browser = webdriver.Firefox() |
Pop-up Window Operation
Sometimes certain operations will pop up a warning box, requiring you to switch to this window for corresponding operations before continuing subsequent steps.
1 | # Switch to the alert window |
Browser Settings Change
Sometimes special settings need to be made to the browser to complete the required operation. For example, I once needed to obtain data by clicking a download button, but normally when clicking the download button, the browser would pop up a download window.
1 | options = Options() |
Waiting Settings
All Selenium operations are performed after the page has finished loading, but due to network issues, if you wait until it is fully loaded before executing, it significantly affects efficiency (Selenium is already less efficient than purely sending requests via code). Therefore, some settings can be made to execute related operations immediately when a specific condition is met. For example, below waits for a specific search button to become clickable before entering the search content and clicking the search.
1 | button = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, 'search-bar-btn'))) |
Download File Checking
Selenium itself cannot manage download files, so it can only be done through other means. For example:
- If the download link can be parsed out from the webpage code, you can call other tools to download.
- If the download link cannot be parsed out, you can use
os
module related content to monitor the downloaded file and rename it after downloading.
I used the following two parts of code together at that time:
1 | import os |
About the Pitfall of Simulating Clicks
Selenium can simulate clicks on buttons, links, checkboxes, etc., but the click must be performed when these elements are within the visible range; otherwise, the click will be ineffective or an exception will be thrown. Therefore, in actual operation, you need to test whether scrolling is required and how much to scroll to ensure that the element can be clicked.
1 | # Switch out of iframe |
Miscellaneous
- Generally, the first GET request gets the framework of the webpage, and then new requests are sent to fill in the content needed.
- If you need to obtain corresponding information, you need to analyze AJAX requests.
- Selenium/WebDriver
- Splash