转52PJ的爬虫爬取美女图片

网络资讯 2023-01-19 22:28:50 15

导读

)"')foriteminh4List:child_href='https://www.mn52.com'+''.join#将bs4.element.Tag对象转为string对象#拿到子页面的源代码child_page_resp=requests.get#文件夹命名p_n……

前言学完正则后就学习了爬虫篇章，在学前已定下了三个待爬网站，昨日已经完工，自学爬虫详细过程及实现已写以成文，但大部分文章CSDN上不让发，写文章一是为了记录方便来时翻阅，一是分享供他人借鉴，爬取了两美女网站，还不错，第一个文章目的达成了，第二个就靠分享来实现咯简介第一个网站爬取了一页，压缩下来有69M左右，之所以不多爬，是因为身体吃不消，主要是用不着，需要的话改下请求参数即可，第二个网站禁止按键右键，但无伤大雅，爬取了清纯美女前10页，还有清新美女专栏，爬取了一页，稍微改下url即可

# 1.来到美女图片网页，然后查看网页源代码确认是服务器渲染，然后获取源代码，批量提取高清大图页面链接地址

# 2.通过对两个高清大图链接地址定位比较，找到了共有的定位条件，然后获取链接下载

import requests

import re

from bs4 import BeautifulSoup

import time

import os

url = 'https://www.mn52.com/meihuoxiezhen/'

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'

}

resp = requests.get(url,headers=headers)

# 把源码交给bs

main_page = BeautifulSoup(resp.text,'html.parser')

h4List = main_page.find_all("h4")

for i in range(-1,1):

h4List.pop(i)

# print(h4List) # 测试用的

p = re.compile(r'"(?P.*?)"')

for item in h4List:

child_href = 'https://www.mn52.com' + ''.join(p.findall(str(item))) # 将bs4.element.Tag对象转为string对象

# 拿到子页面的源代码

child_page_resp = requests.get(child_href,headers=headers)

# 文件夹命名

p_name = re.compile(r'

.*?

(?P.*?)

',re.S)

img_name = p_name.findall(child_page_resp.text)

img_name_f = ''.join(img_name)

dirs = 'D:/python_img/' + img_name_f

# 如果不存在这个文件夹就创建这个文件夹

if not os.path.exists(dirs):

os.makedirs(dirs)

# print(img_name) # 测试用的

# print(child_page_resp.text) # 测试用的

# break

# 把源代码交给bs

child_page = BeautifulSoup(child_page_resp.text,'html.parser')

imgList = child_page.find('div',id='piclist').find_all('img')

# print(imgList) # 测试用的

num = 1

for item in imgList:

src = 'https://www.mn52.com' + item.get('src')

# 下载图片

img_resp = requests.get(src)

# img_resp.content # 这里拿到的是字节

with open(f'{dirs}/{num}.jpg','wb') as f:

f.write(img_resp.content)

num += 1

print('over!!!')

time.sleep(1) # 缓一秒爬取

print('Game Over!!!')

resp.close()

第二个网页源代码 # 1.拿到主页面的源代码，然后提取到子页面的链接地址(这里可以先在网页打开子页面获取链接，然后反向定位在网页源码具体处)

# 2.正常情况下请求子页面源代码，然后获取下载链接下载，但这个有些不一样，该网站已经分好类了，所以不从首页爬起反而更高效

import os.path

import requests

from bs4 import BeautifulSoup

import time

headers = {

'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'

}

num = 1

# 爬取清纯少女前10页图片

while num <11:

url = f'https://www.vmgirls.com/pure/page/{num}'

resp = requests.get(url,headers=headers)

# print(resp.text) #测试用的

# 把源码交给bs

main_page = BeautifulSoup(resp.text,'html.parser')

aList = main_page.find_all('a',class_='list-title h-2x')

# print(len(aList)) # 测试用的

for a in aList:

a_href = a.get('href')

# print(a_href) # 测试用的

child_page_resp = requests.get(a_href,headers=headers)

# print(child_page_resp.text) # 测试用的

# 把源码交给bs

child_page = BeautifulSoup(child_page_resp.text,'html.parser')

img_name_f = child_page.find('title').string # 获取title标签下的文本

# print(img_name_f) # 测试用的

#文件夹命名

dirs = 'D:/Python_image/'+img_name_f

# 如果不存在该路径，那就创建一个

if not os.path.exists(dirs):

os.makedirs(dirs)

imgList = child_page.find('div',class_='nc-light-gallery').find_all('a')

imgList.pop(0)

count = 1

for item in imgList:

src = item.get('href')

# print(src) # 测试用的

# 下载图片

img_resp = requests.get(src)

with open(f'{dirs}/{count}.jpg','wb') as f:

f.write(img_resp.content)

count += 1

print('over!!!')

time.sleep(1) # 缓一秒爬取

# break # 用于测试

num += 1

break # 用于测试

print('======Game Over!!!======')

resp.close()

成品分享https://wwz.lanzouw.com/b031oobti 提取码：3jvl补充两点

请做一个善意爬虫，毕竟取之于民

任何蓝奏链接打不开都可以以下方式打开，以上面这个蓝奏链接为例，可以改为https://www.lanzou.com/b031oobti 提取码不变

原贴：https://www.52pojie.cn/thread-1711299-1-1.html

死神千年血战篇山本老头好像G了

猜你喜欢

爬（看小姐姐）全站图片

B站付费课之办公三件套——PPT、word、excel

《明日战记》1080P高清中字国粤双语版

tidal账号

某盘下载网站：克隆窝

1022首中文精选高格式-24bit-96khz-w(126GB)

评论(0)

转52PJ的爬虫爬取美女图片

(?P.*?)

黑亚当韩版高清中文字幕（字幕组版本）

死神千年血战篇山本老头好像G了

爬（看小姐姐）全站图片

B站付费课之办公三件套——PPT、word、excel

《明日战记》1080P高清中字国粤双语版

tidal账号

某盘下载网站：克隆窝

1022首中文精选高格式-24bit-96khz-w(126GB)

猫爪呸罗173v22.8G视频合集：颠覆传统的视频拍摄方式，探索新的可能！

转52PJ的爬虫爬取美女图片

(?P.*?)

黑亚当 韩版高清 中文字幕（字幕组版本）

死神千年血战篇 山本老头好像G了

爬（看小姐姐）全站图片

B站付费课之办公三件套——PPT、word、excel

《明日战记》1080P高清中字国粤双语版

tidal账号

某盘下载网站：克隆窝

1022首中文精选高格式-24bit-96khz-w(126GB)

猫爪呸罗173v22.8G视频合集：颠覆传统的视频拍摄方式，探索新的可能！

黑亚当韩版高清中文字幕（字幕组版本）

死神千年血战篇山本老头好像G了