Python爬虫小总结

0x00 前言

最近玩了一下Python的爬虫,感觉这个东西对Python初学者还是挺友好的,适合入门Python

0x01 简单编程套路

  • requests 库发送请求
  • Beautifulsoap,Xpath,Pyquiry,正则,等解释库提取数据
  • 储存数据到文件或者数据库

下面通过一个简单的对猫眼榜单的爬取代码体现上面的思路:

  • get_onepage(url)函数发送请求,主要使用requests库
  • parse_onepage(content)函数负责解释和提取数据,主要使用了Beautifulsoap
  • save_csv(content)函数就是把数据保存到CSV文件
import requests
from bs4 import BeautifulSoup
import csv
import time

def get_onepage(url):
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/60.0.3112.113 Safari/537.36'}
raw_content = requests.get(url,headers=headers)
raw_content.encoding = 'utf-8'
# print(raw_content.text)
return raw_content.text

def parse_onepage(content):
soup = BeautifulSoup(content,'lxml')
movie_item = soup.find_all('dd')
for items in movie_item:
dict = {'index': '', 'title': '', 'star': ''}
movie_index = items.find('i',class_ = "board-index")
dict['index'] = movie_index.get_text()
movie_title = items.find('p',attrs = {'class': 'name'})
dict['title'] = movie_title.find('a').get_text()
movie_actor = items.find('p',attrs = {'class': 'star'})
dict['star'] = movie_actor.get_text().strip()
yield dict

def save_csv(content):
with open("/Users/Rick7/Desktop/data.csv",'a',encoding='utf-8') as csvfile:
fieldnames = ['index', 'title', 'star']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for item in parse_onepage(content):
writer.writerow(item)

def main(offset_number):
url = "http://maoyan.com/board/4?offset="+str(offset_number)
html = get_onepage(url)
save_csv(html)


if __name__ == '__main__':
for i in range(10):
i=i*10
main(i)
time.sleep(1)

0x02 总结

  • 这是最简单的爬虫小程序,从中要体会里面思路
  • 上面的小程序只针对静态页面,对于动态的Ajax加载,使用过selenium,不过感觉十分慢和不友好,建议抓包找去真实的URL比较实际
  • 比起beautifulsoap,感觉xpath提取数据更加高效简洁