Sorry, your browser cannot access this site
This page requires browser support (enable) JavaScript
Learn more >

I have not systematically studied web scraping before, so I am making notes while watching this video.

Request / Response

A request is information sent by a browser or program to the server to request content for display. The response is the server’s reply to the received request.

  • Request content

    • Request method: GET / POST (most commonly used)
      • POST has additional form data compared to GET
      • Parameters in GET are directly included in the URL, while parameters in POST requests are included in the form
    • Request URL: A Uniform Resource Locator (URL) is a link to a file or object
    • Request headers: Important configuration information stored as key-value pairs
    • Request body: Generally no information for GET, but required for POST
  • Response content

    • Response status code: A numeric code used to indicate the status of the request
    • Response headers
    • Response body: The result of the request

Python requests module usage

requests is based on urllib3 and provides a more convenient and feature-rich interface

Adding request parameters

  • The params parameter can conveniently add parameters to the request, avoiding manual URL construction
1
2
3
4
5
data = {
'arg1': '1',
'arg2': '2'
}
response = requests.get('url', params=data)

JSON parsing

  • Provides a json method that directly converts the returned JSON string into a JSON object
1
2
response = requests.get('url', params=data)
response.json()

Binary data retrieval

  • Use GET to request images directly, then write them to a file in ‘wb’ mode

Adding headers

  • Mainly for successful retrieval; some websites identify User-Agent to prevent machine scraping
1
2
3
4
headers = {
'User-Agent': 'XXXX'
}
response = requests.get('url', headers=headers)

POST Request Related

Adding request parameters / headers

  • Similar to GET (see above)

Response attributes

  • Common attributes include:
    • status_code
    • headers
    • cookies
    • url
    • history

Status code analysis

  • The requests library itself categorizes status codes, so you can quickly determine if the request was successful by calling built-in information. For example, response.status_code == requests.codes.ok is equivalent to 200
1
2
response = requests.get('url')
exit() if response.status_code != requests.codes.ok else print('All Right')

Proxy settings

1
2
3
4
5
6
proxy = {
'http': 'http://127.0.0.1:9743',
'https': 'https://user:passwd@127.0.0.1:9743'
}

response = requests.get('url', proxies=proxy)

If using ss, you need to install additional plugins.

1
pip install 'requests[socks]'
1
2
3
4
5
proxy = {
'socks5': 'http://127.0.0.1:9743'
}

response = requests.get('url', proxies=proxy)

Timeout settings

Can be combined with try for exception handling.

1
response = requests.get('url', timeout=1)

Selenium Section

For content obtained through JavaScript and rendered on the page, some elements may not be found when analyzing page requests. At this time, Selenium can control the browser to perform operations. Although this is less efficient, it is very suitable for someone like me who is not familiar with web-related things.

Element Location

To simulate webpage operations, you first need to find the location of the element to be operated on. Selenium provides multiple ways to locate elements; using find_element will return the first matching element, while using find_elements will return a list of all objects.

1
2
3
4
5
6
7
8
9
10
browser = webdriver.Firefox()
browser.get('url')
browser.find_element_by_id('123')
browser.find_element_by_name('123')
browser.find_element_by_xpath('123')
browser.find_element_by_link_text('123')
browser.find_element_by_partial_link_text('123')
browser.find_element_by_tag_name('123')
browser.find_element_by_class_name('123')
browser.find_element_by_css_selector('123')

iframe Location

A webpage may be divided into several parts, with a large frame containing one or more iframes. When located within the main frame, it is impossible to search for or locate content within the small frames, and thus cannot perform corresponding operations. Therefore, you need to switch frames before performing operations.

1
2
3
4
5
6
7
browser = webdriver.Firefox()
browser.get('url')
# Switch to iframe by name
browser.switch_to.frame('analyzeFrame')
# Switch back to main frame
browser.switch_to.default_content()
# Untested if switching between iframes within an iframe is possible

Pop-up Window Operation

Sometimes certain operations will pop up a warning box, requiring you to switch to this window for corresponding operations before continuing subsequent steps.

1
2
3
4
# Switch to the alert window
al = browser.switch_to_alert()
# Click on the accept button in the window
al.accept()

Browser Settings Change

Sometimes special settings need to be made to the browser to complete the required operation. For example, I once needed to obtain data by clicking a download button, but normally when clicking the download button, the browser would pop up a download window.

1
2
3
4
5
6
7
8
options = Options()
# The first item is the setting name, and the second is the corresponding value; available settings can be viewed by entering 'about:config' in the browser.
options.set_preference("browser.download.folderList", 2)
browser = webdriver.Firefox(firefox_options=options)
# The following sets up socks proxy
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=socks5://localhost:1080')
browser = webdriver.Chrome(chrome_options=chrome_options)

Waiting Settings

All Selenium operations are performed after the page has finished loading, but due to network issues, if you wait until it is fully loaded before executing, it significantly affects efficiency (Selenium is already less efficient than purely sending requests via code). Therefore, some settings can be made to execute related operations immediately when a specific condition is met. For example, below waits for a specific search button to become clickable before entering the search content and clicking the search.

1
2
3
button = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, 'search-bar-btn')))
browser.find_element_by_class_name('search-input').send_keys(tar)
button.click()

Download File Checking

Selenium itself cannot manage download files, so it can only be done through other means. For example:

  • If the download link can be parsed out from the webpage code, you can call other tools to download.
  • If the download link cannot be parsed out, you can use os module related content to monitor the downloaded file and rename it after downloading.

I used the following two parts of code together at that time:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import os
# Copied code; this function checks the file size every 5 seconds. If the size has not changed since the last check, it is determined that the download is complete, and the browser object will be closed.
def wait_download(file_path, browser, rep_name):
current_size = getSize(file_path)
print("{} Downloading".format(str(current_size)))
while current_size != getSize(file_path) or getSize(file_path)==0:
current_size =getSize(file_path)
print("current_size:"+str(current_size))
time.sleep(5)# wait download
print("Downloaded")
os.rename(file_path, rep_name)
browser.close()

# Because the download starts itself takes time, the target file appears after starting the above function.
while not os.path.exists(file_path):
print('waiting for download to start...')
time.sleep(5)
wait_download(file_path, browser, rep_name)

About the Pitfall of Simulating Clicks

Selenium can simulate clicks on buttons, links, checkboxes, etc., but the click must be performed when these elements are within the visible range; otherwise, the click will be ineffective or an exception will be thrown. Therefore, in actual operation, you need to test whether scrolling is required and how much to scroll to ensure that the element can be clicked.

1
2
3
4
5
6
# Switch out of iframe
browser.switch_to.default_content()
# Scroll browser to bottom
browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
# Switch back to iframe
browser.switch_to.frame('analyzeFrame')

Miscellaneous

  • Generally, the first GET request gets the framework of the webpage, and then new requests are sent to fill in the content needed.
  • If you need to obtain corresponding information, you need to analyze AJAX requests.
    • Selenium/WebDriver
    • Splash

Comments

Please leave your comments here