最新消息:

破解参数?拒绝头秃,selenium大法好!

Python爬虫 追逐 116浏览 0评论

一个国外海报欣赏站点,typographicposters.com,比较有意思(头秃)的json数据传递,分类也比较有意思,采用的是点击rgb颜色参数获取分类,以海报的颜色为分类,看了下数据还是比较多,直接requests数据是不行的。

目标网站

目标网址

https://www.typographicposters.com/?filter=recent

 

下一页数据的加载是需要手动点击,有个查看更多的按钮

下一页数据加载,应该用了Query

感觉是用了Query

 

抓包数据

抓包数据1协议头,红线处报错

全部协议头添加上,还是画红线处报错提示,服务器错误?

 

怕了,怕了,使用最笨的方法,python selenium搞起!

 

第一步:获取到网页源码,

1.不怎么会处理while的循环逻辑,还是用了for循环!

2.报错是必需有的!!

3.xpath获取到下一页按钮

find_element_by_xpath(‘//div[@class=”pagination”]/button[@class=”button-highlight”]’)

 

附上源码:

#https://www.typographicposters.com
#海报图采集
#20200513by 微信:huguo00289
# -*- coding: UTF-8 -*-
import time
from selenium import webdriver

def xl(browser):
    try:
        browser.find_element_by_xpath('//div[@class="pagination"]/button[@class="button-highlight"]').click()
    except:
        print("网络问题!")
        time.sleep(5)
        browser.find_element_by_xpath('//div[@class="pagination"]/button[@class="button-highlight"]').click()
    time.sleep(3)
    js = "var q=document.documentElement.scrollTop=100000"
    browser.execute_script(js)
    html = browser.page_source
    with open('sj{}.html'.format(i), 'w', encoding='utf-8') as f:
        f.write(html)



chromedriver_path=r"C:\Users\Administrator\AppData\Local\Programs\Python\Python37\chromedriver.exe"  #完整路径
url = 'https://www.typographicposters.com/?filter=recent'
options = webdriver.ChromeOptions()  # 配置 chrome 启动属性
# options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2}) #不加载图片,加快访问速度
options.add_experimental_option("excludeSwitches", ['enable-automation'])  # 此步骤很重要,设置为开发者模式,防止被各大网站识别出来使用了Selenium

browser = webdriver.Chrome(executable_path=chromedriver_path, options=options)

browser.get(url)
time.sleep(5)
js = "var q=document.documentElement.scrollTop=100000"
browser.execute_script(js)
time.sleep(2)
for i in range(1,20):
    print(i)
    if browser.find_element_by_xpath('//div[@class="pagination"]/button[@class="button-highlight"]'):
        xl(browser)
    else:
        print("完成翻页")
        time.sleep(10)
        # 打印当前网页源码
        html = browser.page_source
        with open('sjj.html', 'w', encoding='utf-8') as f:
            f.write(html)
        break






 

第二步:获取到单页面链接,再获取到大图图片链接,下载图片

1.使用了xpath获取到单页面(详情页)链接地址

2.使用了正则获取到图片链接,这里发现头部就有大图地址,使用requests获取到网页源码,再结合正则一步到位获取到图片的地址

 

附上源码:

#https://www.typographicposters.com
#海报图采集
#20200513by 微信:huguo00289
# -*- coding: UTF-8 -*-
import requests,re,time
from lxml import etree
from fake_useragent import UserAgent
from selenium import webdriver

def ua():
    ua=UserAgent()
    headers={"User-Agent":ua.random}
    return headers

def tp(img_url):
    ua = UserAgent()
    headers = {
        'referer': 'https://www.typographicposters.com',
        'User-Agent': ua.random,
    }
    img_name=img_url.split('/')[-1]
    r = requests.get(img_url,headers=headers,timeout=10)
    time.sleep(1)
    with open(img_name,'wb')as f:
        f.write(r.content)
    print(f"{img_name}下载图片成功")


def get_img(url):
    chromedriver_path = r"C:\Users\Administrator\AppData\Local\Programs\Python\Python37\chromedriver.exe"  # 完整路径
    options = webdriver.ChromeOptions()  # 配置 chrome 启动属性
    # options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2}) #不加载图片,加快访问速度
    options.add_experimental_option("excludeSwitches", ['enable-automation'])  # 此步骤很重要,设置为开发者模式,防止被各大网站识别出来使用了Selenium

    browser = webdriver.Chrome(executable_path=chromedriver_path, options=options)

    browser.get(url)
    time.sleep(5)
    html = browser.page_source
    imgs = re.findall(r'<picture><source type="image/webp" srcset=".*?"><img srcset="(.+?)"></picture>', html, re.S)
    for img in imgs:
        img = img.split(',')[-1]
        print(img)

    browser.close()

def getimg(url):
    html=requests.get(url,headers=ua(),timeout=10).content.decode('utf-8')
    img_url=re.findall(r'<meta property="og:image:url" content="(.+?)" />',html,re.S)[0]
    print(img_url)
    tp(img_url)




with open("sj17.html",encoding='utf-8') as f:
    html=f.read()
req=etree.HTML(html)
hrefs=req.xpath('//div[@class="col-6 posters-item"]/a/@href')
print(len(hrefs))
for href in hrefs:
    url = f"https://www.typographicposters.com{href}"
    print(url)
    try:
        img_url=getimg(url)
    except:
        pass


'''
imgs=re.findall(r'<picture><source type="image/webp" srcset=".*?"><img srcset="(.+?)"></picture>',html,re.S)
for img in imgs:
    img=img.split(',')[-1]
    print(img)
url=f"https://www.typographicposters.com{}"
'''

 

太偷懒了,还是需要破解post参数才好!

 

转载请注明:二爷记 » 破解参数?拒绝头秃,selenium大法好!

发表我的评论
取消评论
表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址