Selenium Crawl API@H_502_5@

文章目录

Selenium Crawl API

Selenium 常用方法

使用selenium操作时需要时刻模拟认为的操作行为方式
- 比如：登录等需要多少秒，最好具体到每一小步

初始化

self.option = webdriver.ChromeOptions()
self.option.add_experimental_option('excludeSwitches', ['enable-automation'])
self.driver = webdriver.Chrome(chrome_options=self.option)
# self.driver.maximize_window()

from selenium import webdriver

chrome_options=webdriver.ChromeOptions() # 设置谷歌浏览器的一些选项
chrome_options.add_argument('lang=zh_CN.UTF-8') #设置编码格式
                            #模拟移动设备   (移动版网站的反爬虫的能力比较弱)   模拟iPhone6
chrome_options.add_argument('user-agent="Mozilla/5.0 (iPhone; cpu iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13B143 Safari/601.1"')

chrome_options.add_argument('--no-sandBox')  #取消沙盒模式，解决DevToolsActivePort文件不存在的报错
chrome_options.add_argument('--disable-gpu')  #谷歌文档提到需要加上这个属性来规避bug
chrome_options.add_argument('--disable-dev-shm-usage')  #克服有限的资源问题  【但是用于Linux系统】
chrome_options.add_argument('--hide-scrollbars') #隐藏滚动条, 应对一些特殊页面
chrome_options.add_argument('blink-settings=imagesEnabled=false') #不加载图片, 提升速度
chrome_options.add_argument('--headless') #浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败

proxy='222.221.11.119:3128'
chrome_options.add_argument('--proxy-server=http://'+proxy)  #设置代理
driver=webdriver.Chrome(executable_path='C:\\Users\Administrator\Anaconda3\Scripts\chromedriver.exe',chrome_options=chrome_options)
driver.get('http://www.baidu.com')
#print(driver.page_source) #打印源码
print(len(driver.page_source)) #744016

常用的操作

driver.page_source 当前标签页浏览器渲染之后的网页源代码
driver.current_url 当前标签页的url
driver.close() 关闭当前标签页，如果只有一个标签页则关闭整个浏览器
driver.quit() 关闭浏览器
driver.forward() 页面前进
driver.back() 页面后退
driver.screen_shot(img_name) 页面截图
driver.execute_script(“window.scrollTo(0,document.body.scrollHeight);”) 滑动到底部
driver.send_keys() 输入
driver.clear() 清楚

Chrome_Options 与 option区别

Chrome_Options

# selenium启动配置参数接收是ChromeOptions类
from selenium import webdriver
option = webdriver.ChromeOptions()

# 添加UA
options.add_argument('user-agent="MQQbrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"')

# 指定浏览器分辨率
options.add_argument('window-size=1920x3000') 

# 谷歌文档提到需要加上这个属性来规避bug
chrome_options.add_argument('--disable-gpu') 

 # 隐藏滚动条, 应对一些特殊页面
options.add_argument('--hide-scrollbars')

# 不加载图片, 提升速度
options.add_argument('blink-settings=imagesEnabled=false') 

# 浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败
options.add_argument('--headless') 

# 以最高权限运行
options.add_argument('--no-sandBox')

# 手动指定使用的浏览器位置
options.binary_location = r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" 

#添加crx插件
option.add_extension('d:\crx\AdBlock_v2.17.crx') 

# 禁用JavaScript
option.add_argument("--disable-javascript") 

# 设置开发者模式启动，该模式下webdriver属性为正常值
options.add_experimental_option('excludeSwitches', ['enable-automation']) 

# 禁用浏览器弹窗
prefs = {  
    'profile.default_content_setting_values' :  {  
        'notifications' : 2  
     }  
}  
options.add_experimental_option('prefs',prefs)


driver=webdriver.Chrome(chrome_options=chrome_options)

其他配置项目参数

–user-data-dir=”[PATH]” 
# 指定用户文件夹User Data路径，可以把书签这样的用户数据保存在系统分区以外的分区

　　–disk-cache-dir=”[PATH]“ 
# 指定缓存Cache路径

　　–disk-cache-size= 
# 指定Cache大小，单位Byte

　　–first run 
# 重置到初始状态，第一次运行

　　–incognito 
# 隐身模式启动

　　–disable-javascript 
# 禁用Javascript

　　--omniBox-popup-count="num" 
# 将地址栏弹出的提示菜单数量改为num个

　　--user-agent="xxxxxxxx" 
# 修改HTTP请求头部的Agent字符串，可以通过about:version页面查看修改效果

　　--disable-plugins 
# 禁止加载所有插件，可以增加速度。可以通过about:plugins页面查看效果

　　--disable-javascript 
# 禁用JavaScript，如果觉得速度慢在加上这个

　　--disable-java 
# 禁用java

　　--start-maximized 
# 启动就最大化

　　--no-sandBox 
# 取消沙盒模式

　　--single-process 
# 单进程运行

　　--process-per-tab 
# 每个标签使用单独进程

　　--process-per-site 
# 每个站点使用单独进程

　　--in-process-plugins 
# 插件不启用单独进程

　　--disable-popup-blocking 
# 禁用弹出拦截

　　--disable-plugins 
# 禁用插件

　　--disable-images 
# 禁用图像

　　--incognito 
# 启动进入隐身模式

　　--enable-udd-profiles 
# 启用账户切换菜单

　　--proxy-pac-url 
# 使用pac代理 [via 1/2]

　　--lang=zh-CN 
# 设置语言为简体中文

　　--disk-cache-dir 
# 自定义缓存目录

　　--disk-cache-size 
# 自定义缓存最大值（单位byte）

　　--media-cache-size 
# 自定义多媒体缓存最大值（单位byte）

　　--bookmark-menu 
# 在工具 栏增加一个书签按钮

　　--enable-sync 
# 启用书签同步

标签切换

1. 获取当前所有的标签页的句柄构成的列表
current_windows = driver.window_handles

2. 根据标签页句柄列表索引下标进行切换
driver.switch_to.window(current_windows[0])

3.switch_to切换frame标签

iframe进行操作，需要用到一下种方法：
 
#  先跳转到内层页面（默认使用表单的id或name属性
driver.switch_to_iframe()    # 下标方式  过时，不建议        切换到iframe上	
driver.switch_to.frame()     # iframe_name方式            切换到iframe上
driver.switch_to.default_content()         切换回原主页面

句柄

获取当前url方法：driver.current_url
获取当前页面句柄：driver.current_window_handle
获取所有页面句柄：driver.window_handles
切换窗口：driver.switch_to.window()

特殊情况：有些弹窗有可能是假弹窗，需要根据真实的窗口获取xpath
特殊情况：iframe里再嵌套iframe(外层iframe的name和class是动态变化的，需要绝对定位xpath)

self.driver.switch_to.frame(self.driver.find_element_by_xpath("/html/body/div[1]/div[2]/div[4]/div[2]/div[2]/iframe"))
self.driver.switch_to.frame("ptlogin_iframe")

清除浏览器窗口

	def Clear_window(self):
		# 清空其他窗口
		# 获得当前窗口
		Nowhandle = self.browser.current_window_handle
		all_h = self.browser.window_handles
		for hand in all_h:
			if hand == Nowhandle:
				continue
			else:
				self.browser.switch_to.window(hand)
				self.browser.close()
		self.browser.switch_to.window(Nowhandle)

自动向下滑动

	def scroll_to_bottom(self,driver):
		'''自动化下滑'''
		js = "return action=document.body.scrollHeight"
		# 初始化现在滚动条所在高度为0
		height = 0
		# 当前窗口总高度
		new_height = driver.execute_script(js)
		while height < new_height:
			# 将滚动条调整至页面底部
			num_down = random.randint(10, 20)
			for i,j in zip(range(height, new_height, 100),range(num_down)):
				driver.execute_script('window.scrollTo(0, {})'.format(i))
				time.sleep(1.5)
			break

上传 图片操作

方法一：使用win的上传图片功能【通用操作】

   def UploadPicture(self, filePath, browser_type="Chrome"):
        """
        添加图片控制
        """
        try:
            # Chrome浏览器
            if browser_type == "Chrome":
                title = "打开"
                # Firefox浏览器
            else:
                title = "文件上传"
            dialog = win32gui.FindWindow("#32770", title)  # 一级窗口  ‘打开窗口’
            ComboBoxEx32 = win32gui.findwindowex(dialog, 0, "ComboBoxEx32", None)  # 二级
            ComboBox = win32gui.findwindowex(ComboBoxEx32, 0, "ComboBox", None)  # 三级
            edit = win32gui.findwindowex(ComboBox, 0, "Edit", None)  # 四级
            button = win32gui.findwindowex(dialog, 0, "Button", None)  # 四级
            # 往文件名编辑框中输入文件路径
            win32gui.SendMessage(edit, win32con.WM_SETTEXT, None, filePath)
            # 点击打开按钮
            win32gui.SendMessage(dialog, win32con.WM_COMMAND, 1, button)
            time.sleep(random.uniform(3, 5))
            return True
        except Exception as error:
            print('uploadPicture error', error)
            return False

方法二：使用上传按钮附近的input元素上传

self.driver.find_element_by_xpath('''//input[@data-text="true"]''').send_keys(self.image)

鼠标操作

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
 
dr = webdriver.Chrome()
dr.get('http://www.baidu.com')
ActionChains(dr).move_by_offset(200, 100).click().perform() # 鼠标左键点击， 200为x坐标， 100为y坐标
ActionChains(dr).move_by_offset(200, 100).context_click().perform() # 鼠标右键点击

键盘操作

# 导包
from selenium.webdriver.common.keys import Keys
#Keys.BAcR_sPAcE：回退键（Backspace）
#Keya.TAB：制表键（Tab）
#Keys.ENTER：回车键（Enter）
#Keya.3HIFT：大小写转换键（shift）
#Keys.CoNTROL:Control键（ctr1）
#Keya.ALT:ALT键（A1t）
#Reya.ESCAPE：返回键（Eac）
#Keya.spACE：空格键（space）
#Keya.PAGEUP：翻页键上（Page Up）
#Keya.PAGE_DoWN；翻页键下（Page Down）
#Reya.END：行尾键（End）
#Keya.HoME：行首键（Home）
#Keya.LEFT：方向键左（Left）
#Keya.UP：方向键上（Up）
#Keya.RIGHT：方向键右（Right）
#Keya.DowN：方向键下（Down）
#Keya.IN3ERT：插入键（Insert）
#DELETE：删除键（Delete）
#NUMPADO~NUMPAD9：数字键1-9
#F1~F12：F1-F12键
#（Keya.CoNTROL，'a'）：组合键control+a，全选#（Keya.coNTROL，'c'）：组合键control+c，复制#（Keya.CONTROL，'x'）：组合键control+x，剪切
#（Keya.CoNTROL，v'）：组合键control+v，粘贴

等待直到

from selenium import webdriver 

from selenium.webdriver.support.wait import webdriverwait 

from selenium.webdriver.support import expected_conditions as EC 

from selenium.webdriver.common.by import By

 

locator=(By.XPATH,”xxxxxxx”)

d = webdriver.Chorme()

d.get(“http://www.sina.com”)

webdriverwait(d,10,1).unitl(EC.presence_of_element_located(locator))

Print(“XXX”)

二次截图

from selenium import webdriver
import time
from PIL import Image
 
driver = webdriver.Chrome()
driver.get('https://www.baidu.com/')
time.sleep(3)
 
# 演示一：全网页截图
# driver.save_screenshot('screenshot.png')
# driver.quit()
 
# 演示二：定位区块截图
driver.save_screenshot(r'photo.png')  # 一次截图：形成全图
baidu = driver.find_element_by_id('su')  # 截图按钮百度一下
# baidu = driver.find_element_by_xpath("//div[@id='lg']/img[@class='index-logo-src']") #截图百度logo图片
# print(baidu)
left = baidu.location['x']  # 区块截图左上角在网页中的x坐标
top = baidu.location['y']  # 区块截图左上角在网页中的y坐标
right = left + baidu.size['width']  # 区块截图右下角在网页中的x坐标
bottom = top + baidu.size['height']  # 区块截图右下角在网页中的y坐标
# print({"left": left, "top": top, "right": right, "bottom ": bottom})
# print("baidu.size['width']:%s" % baidu.size['width'])
# print("baidu.size['height']:%s" % baidu.size['height'])
picture = Image.open(r'photo.png')
picture = picture.crop((left, top, right, bottom))  # 二次截图：形成区块截图
picture.save(r'photo2.png')
driver.quit()

`Xpath解析`

模糊查询
- //ul/li[text()=“经验教程”] --全文字匹配
- self.driver.find_element_by_link_text()
- //a[contains(text(), “搜索”)] --模糊文字匹配
- //li[contains,(.,‘单次预约’)]
常用解析
- 解析文字：post_time = selector.xpath(’//div[@class=“video-data”]/span/text()’)
- 解析结果：[‘295播放\xa0·\xa0’, ‘0弹幕’, ‘2020-12-05 09:00:01’]
- 普通解析
  
  extract() 返回一个包含有字符串的列表,配合[0]使用
  
  extract_first() 返回列表中的第一个字符串，列表为空没有返回None
  
  get() 提取列表中第1个文本内容(等同于extract_first())

`Xpath难点`

元素无法点击（报错：该元素不是可点击元素）【客户端渲染，需要研究分析js代码实现同等效果的操作；或者通过火狐浏览器对js代码点击事件进行研究】
- 例子：头条号微头条上传图片的点击按钮
Message: element not interactable【1、等待元素完全加载。2、换xpath方式进行解析实现（chrome复制的xpath）】
- 例子：头条号写回答无法定位元素进行send_key
跳转的链接是通过客户端异步加载完成的，当你鼠标放在跳转元素上，无法看到跳转的链接【无法复制链接地址、无法生成新的链接标签页面】，唯有你点击之后才会跳转覆盖当前页面
元素无法交互【下拉框】
- 先点击下拉图标让数据显示出来再通过xpath直接定位

Chrome 识别标志

consloe：navigator

Selenium 键盘操作

键值	解释
send_keys(Keys.BACK_SPACE)	删除键BackSpace
send_keys(Keys.SPACE)	空格键Space
send_keys(Keys.TAB)	制表键Tab
send_keys(Keys.ESPACE)	回退键Esc
send_keys(Keys.ENTER)	回车键Enter
send_keys(Keys.CONTROL,‘a’)	全选Ctrl+A
send_keys(Keys.CONTROL,‘c’)	复制CTRL+C
send_keys(Keys.CONTROL,‘x’)	剪切CTRL+X
send_keys(Keys.CONTROL,‘v’)	粘贴Ctrl+V
send_keys(Keys.F1)	键盘F1
send_keys(Keys.F12)	键盘F12

Selenium 注意问题

切换页面得时候一定要多等几面，等加载完

Selenium 常用API大全【经验总结】

Selenium Crawl API@H_502_5@

文章目录

Selenium 常用方法

初始化

常用的操作

Chrome_Options 与 option区别

标签切换

句柄

清除浏览器窗口

自动向下滑动

上传 图片操作

鼠标操作

键盘操作

等待直到

二次截图

`Xpath解析`

`Xpath难点`

Chrome 识别标志

Selenium 键盘操作

Selenium 注意问题

相关推荐