用PHP和Python轻松抓取网页图片

优采云发布时间: 2023-06-07 18:36

　　本文将详细介绍如何使用PHP和Python抓取网页中的图片。如果你是一名爬虫初学者，那么这篇文章一定会对你有所帮助。

　　第一步：分析网页结构

　　在开始编写代码之前，我们需要先了解要抓取的网页结构。以博客园为例，我们可以通过浏览器的开发者工具查看网页源代码。在源代码中找到图片标签php 抓取图片python爬虫，并观察其属性，以便后续编写代码。

　　第二步：使用PHP抓取图片

　　在PHP中，我们可以使用file_get_contents()函数获取网页源代码，并使用preg_match_all()函数匹配图片链接。具体代码如下：

$url ="https://www.cnblogs.com/";

$html = file_get_contents($url);

preg_match_all('/<img.*?src="(.*?)".*?>/is',$html,$matches);

foreach ($matches[1] as $key =>$value){

echo $value ."<br>";

}

　　第三步：使用Python抓取图片

　　与PHP类似，在Python中也可以通过requests库获取网页源代码，并使用re库匹配图片链接。具体代码如下：

　　732fcd6713a26f380309a49b40c10add="https://www.cnblogs.com/"

html = requests.get(url).text

pattern = re.compile('<img.*?src="(.*?)".*?>', re.S)

result = pattern.findall(html)

for item in result:

print(item)

　　第四步：解决图片链接相对路径问题

　　在实际应用中，我们可能会遇到图片链接为相对路径的情况。此时php 抓取图片python爬虫，我们需要将相对路径转换为绝对路径，以便可以正确下载图片。具体代码如下：

$url ="https://www.cnblogs.com/";

$html = file_get_contents($url);

preg_match_all('/<img.*?src="(.*?)".*?>/is',$html,$matches);

foreach ($matches[1] as $key =>$value){

if (strpos($value,"http")!==0){

$value =$url .$value;

}

echo $value ."<br>";

}

　　第五步：下载图片

　　获取到图片链接后，我们需要将其下载到本地。在PHP中用PHP和Python轻松抓取网页图片，我们可以使用file_put_contents()函数实现图片下载。具体代码如下：

$url ="https://www.cnblogs.com/";

$html = file_get_contents($url);

preg_match_all('/<img.*?src="(.*?)".*?>/is',$html,$matches);

foreach ($matches[1] as $key =>$value){

if (strpos($value,"http")!==0){

$value =$url .$value;

}

file_put_contents("images/". basename($value), file_get_contents($value));

}

　　在Python中用PHP和Python轻松抓取网页图片，我们可以使用urllib库实现图片下载。具体代码如下：

import requests

import re

import os

url ="https://www.cnblogs.com/"

html = requests.get(url).text

pattern = re.compile('<img.*?src="(.*?)".*?>', re.S)

result = pattern.findall(html)

if not os.path.exists("images"):

os.makedirs("images")

for item in result:

if not item.b1d1e10add929d03ff21980c48ec0769("http"):

item = url + item

with open("images/"+ os.path.basename(item),"wb") as f:

f.write(requests.get(item).content)

　　第六步：使用多线程加速下载

　　当需要下载大量图片时，单线程下载可能会比较慢。此时，我们可以使用多线程加速下载。在PHP中，我们可以使用curl_multi_init()函数和curl_multi_exec()函数实现多线程下载。具体代码如下：

$url ="https://www.cnblogs.com/";

$html = file_get_contents($url);

preg_match_all('/<img.*?src="(.*?)".*?>/is',$html,$matches);

$mh = curl_multi_init();

foreach ($matches[1] as $key =>$value){

if (strpos($value,"http")!==0){

$value =$url .$value;

}

$ch[$key]= curl_init($value);

curl_setopt($ch[$key], CURLOPT_RETURNTRANSFER, true);

curl_multi_add_handle($mh,$ch[$key]);

}

do {

curl_multi_exec($mh,$running);

} while ($running >0);

foreach ($matches[1] as $key =>$value){

if (strpos($value,"http")!==0){

$value =$url .$value;

}

file_put_contents("images/". basename($value), curl_multi_getcontent($ch[$key]));

curl_multi_remove_handle($mh,$ch[$key]);

}

curl_multi_close($mh);

　　在Python中，我们可以使用线程池实现多线程下载。具体代码如下：

import requests

import re

import os

from concurrent.futures import ThreadPoolExecutor

url ="https://www.cnblogs.com/"

html = requests.get(url).text

pattern = re.compile('<img.*?src="(.*?)".*?>', re.S)

result = pattern.findall(html)

if not os.path.exists("images"):

os.makedirs("images")

def download_img(item):

if not item.b1d1e10add929d03ff21980c48ec0769("http"):

item = url + item

with open("images/"+ os.path.basename(item),"wb") as f:

f.write(requests.get(item).content)

with ThreadPoolExecutor(max_workers=10) as executor:

executor.map(download_img, result)

　　第七步：使用代理IP

　　在爬虫过程中，有些网站会封禁IP地址，此时我们需要使用代理IP来避免被封禁。在PHP中，我们可以使用curl_setopt()函数设置代理IP。具体代码如下：

$url ="https://www.cnblogs.com/";

$html = file_get_contents($url);

preg_match_all('/<img.*?src="(.*?)".*?>/is',$html,$matches);

$mh = curl_multi_init();

foreach ($matches[1] as $key =>$value){

if (strpos($value,"http")!==0){

$value =$url .$value;

}

$ch[$key]= curl_init($value);

curl_setopt($ch[$key], CURLOPT_RETURNTRANSFER, true);

curl_setopt($ch[$key], CURLOPT_PROXY,"http://127.0.0.1:cf79ae6addba60ad018347359bd144d2");

curl_multi_add_handle($mh,$ch[$key]);

}

do {

curl_multi_exec($mh,$running);

} while ($running >0);

foreach ($matches[1] as $key =>$value){

if (strpos($value,"http")!==0){

$value =$url .$value;

}

file_put_contents("images/". basename($value), curl_multi_getcontent($ch[$key]));

curl_multi_remove_handle($mh,$ch[$key]);

}

curl_multi_close($mh);

　　在Python中，我们可以使用requests库设置代理IP。具体代码如下：

import requests

import re

import os

url ="https://www.cnblogs.com/"

html = requests.get(url).text

pattern = re.compile('<img.*?src="(.*?)".*?>', re.S)

result = pattern.findall(html)

if not os.path.exists("images"):

os.makedirs("images")

proxies ={

"http":"http://127.0.0.1:cf79ae6addba60ad018347359bd144d2",

"https":"http://127.0.0.1:cf79ae6addba60ad018347359bd144d2",

}

for item in result:

if not item.b1d1e10add929d03ff21980c48ec0769("http"):

item = url + item

with open("images/"+ os.path.basename(item),"wb") as f:

f.write(requests.get(item, proxies=proxies).content)

　　第八步：总结

　　本文介绍了如何使用PHP和Python抓取网页中的图片，并且详细讲解了相对路径转换为绝对路径、下载图片、多线程加速下载、使用代理IP等技巧。希望本文能够对爬虫初学者有所帮助。

0

2023-06-07

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

用PHP和Python轻松抓取网页图片

0 个评论

发起人