最新消息:

Python爬虫,知乎问答爬虫爬取文字与图片demo(不使用Cookie)

Python爬虫 追逐 120浏览 0评论

谢邀,人在美国,刚下飞机!

上面这段几乎是逛知乎社区的大(比)佬耳熟能详的段子,从几何起,知乎也已经蜕变成最大的段子,灌水,钓鱼贴的集合区,质量度远远没有以前那么高了,当然其中还有河蟹神兽出没,莫(牛)名(逼)管理删帖封号,知乎已经不是以前的知乎了!

 

逼乎,分享你刚编的故事,当然其中还有各种LSP最爱的钓鱼帖,跪在真实,手动狗头保命!!

 

爬取目标链接:https://www.zhihu.com/question/328457531

 

这里本渣渣就以其中的一个钓鱼帖,带来知乎问答Python爬虫,知乎问答爬虫爬取文字与图片demo(不使用Cookie),不用登陆获取知乎问答的数据,你只需要获取到问答链接或者id号。

LSP的最爱!!!

获取知乎问答有以下三种方式:

参考源码:

#获取知乎问答id
#20201208 @author:WX:huguo00289
#@微信公众号:二爷记


# -*- coding: UTF-8 -*-
import re


def get_id(url):
    if "question" and "answer" in url:
        print("您输入的是问答全网址,正在获取id..")
        id=re.search(r'question/(.+?)/answer',url).group(1)
    elif "question" in url:
        print("您输入的是问题网址,正在获取id..")
        id = url.split('/')[-1]
    else:
        print("您输入的是问答id,已获取id..")
        id =url
    print(f'>> 您输入的知乎问答id为:{id}')
    return i

 

由于知乎的数据链接几乎都是json格式,接口的存在使得你直接请求接口再解析数据即可,唯一需要注意的是分页形式及相关参数!

 

这里需要注意的参数有三个:

  • 问答ID 知乎问答的链接ID
  • limit 知乎问答的数据个数,一般限定为5,初始回答页面本渣渣这里定义为0页,限定为3
  • offset 分页页码

 

获取单页数据参考源码:

    #获取单页数据
    def get_content(self,page):
        url=f"https://www.zhihu.com/api/v4/questions/{self.id}/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics%3Bsettings.table_of_content.enabled%3B&limit=5&offset={page}&platform=desktop&sort_by=default"
        response=requests.get(url,headers=self.headers,timeout=5)
        time.sleep(2)
        print(response.status_code)
        html=response.content.decode('utf-8')
        req=json.loads(html)
        json_datas=req['data']
        self.get_data(page,json_datas)

获取0页数据答案数参考源码:

    #获取0页数据及答案数
    def get_pagenum(self):
        page=0
        url = f"https://www.zhihu.com/api/v4/questions/{self.id}/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics%3Bsettings.table_of_content.enabled%3B&limit=3&offset=&platform=desktop&sort_by=default"
        response = requests.get(url, headers=self.headers, timeout=5)
        time.sleep(2)
        print(f">> 正在获取第{page}页数据..")
        print(response.status_code)
        html = response.content.decode('utf-8')
        req = json.loads(html)
        totals=req['paging']['totals']
        print(f'共有回答数:{totals}')
        self.get_page(totals)
        json_datas = req['data']
        self.get_data(page,json_datas)

比较有意思的就是知乎问答回答数据的分页组合形式,这里给出参考,可能并不一定准确哈!

#获取页码
    def get_page(self,totals):
        pagenum=(int(totals)-4)/5
        #print(pagenum)
        if pagenum>int(pagenum):
            pagenum=int(pagenum)+1
        if pagenum==int(pagenum):
            pagenum = int(pagenum)


        self.pagenum=pagenum
        print(f'>> 共有{self.pagenum}回答分页')

运行效果:

 

爬取效果:

 

完整python源码获取

请关注本渣渣微信公众号:二爷记

微信公众号:二爷记

后台回复“知乎问答”

exe获取授权码

请关注本渣渣微信公众号:二爷记

微信公众号:二爷记

后台回复“知乎爬取授权”

 

转载请注明:二爷记 » Python爬虫,知乎问答爬虫爬取文字与图片demo(不使用Cookie)

发表我的评论
取消评论
表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址