Python爬虫常用库的使用
builtwith的使用
Requests库的使用
实例引入
1 | import requests |
各种请求方式
1 | import requests |
基本GET请求
基本写法
1 | import requests |
带参数GET请求
1 | import requests |
1 | import requests |
解析json
1 | import requests |
获取二进制数据
1 | import requests |
1 | import requests |
添加headers
1 | import requests |
1 | import requests |
基本POST请求
1 | import requests |
1 | import requests |
reponse属性
1 | import requests |
状态码判断
1 | import requests |
1 | import requests |
1 | 100: ('continue',), |
文件上传
1 | import requests |
获取cookie
1 | import requests |
会话维持
模拟登录
1 | import requests |
1 | import requests |
证书验证
1 | import requests |
1 | import requests |
1 | import requests |
代理设置
1 | import requests |
1 | import requests |
1 | pip3 install 'requests[socks]' |
1 | import requests |
超时设置
1 | import requests |
认证设置
1 | import requests |
1 | import requests |
异常处理
1 | import requests |
Connection error
Urllib库的使用
urllib.urlopen
get请求
1 | import urllib.request |
<html>
<script>
window.location.href="http://blog.csdn.net/asahinokawa"
</script>
</html>
post请求
1 | import urllib.parse |
{
"args": {},
"data": "",
"files": {},
"form": {
"word": "hello"
},
"headers": {
"Accept-Encoding": "identity",
"Connection": "close",
"Content-Length": "10",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "Python-urllib/3.6"
},
"json": null,
"origin": "14.154.29.216",
"url": "http://httpbin.org/post"
}
超时时限设置。超过此值将抛出异常。
1 | import urllib.request |
{
"args": {
"data": "haha"
},
"headers": {
"Accept-Encoding": "identity",
"Connection": "close",
"Host": "httpbin.org",
"User-Agent": "Python-urllib/3.6"
},
"origin": "14.154.29.216",
"url": "http://httpbin.org/get?data=haha"
}
捕捉异常
1 | import socket |
Time Out
响应
响应类型
1 | import urllib.request |
<class 'http.client.HTTPResponse'>
200
[('Server', 'GitHub.com'), ('Content-Type', 'text/html; charset=utf-8'), ('Last-Modified', 'Sat, 20 Jan 2018 04:10:07 GMT'), ('Access-Control-Allow-Origin', '*'), ('Expires', 'Mon, 22 Jan 2018 13:32:43 GMT'), ('Cache-Control', 'max-age=600'), ('X-GitHub-Request-Id', '71A8:11832:91183A:9A71F6:5A65E5A3'), ('Content-Length', '106'), ('Accept-Ranges', 'bytes'), ('Date', 'Mon, 22 Jan 2018 14:26:04 GMT'), ('Via', '1.1 varnish'), ('Age', '480'), ('Connection', 'close'), ('X-Served-By', 'cache-hnd18748-HND'), ('X-Cache', 'HIT'), ('X-Cache-Hits', '1'), ('X-Timer', 'S1516631164.467386,VS0,VE0'), ('Vary', 'Accept-Encoding'), ('X-Fastly-Request-ID', 'f3adc99bea78d082bc4447a4a04f427731a40dad')]
GitHub.com
Request
构造请求,包括增添一些请求头部信息等。
1 | from urllib import request, parse |
<html>
<script>
window.location.href="http://blog.csdn.net/asahinokawa"
</script>
</html>
{
"args": {},
"data": "",
"files": {},
"form": {
"name": "Germey"
},
"headers": {
"Accept-Encoding": "identity",
"Connection": "close",
"Content-Length": "11",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "Mozilla/4.0(compatible;MSIE 5.5;Windows NT)"
},
"json": null,
"origin": "14.154.29.0",
"url": "http://httpbin.org/post"
}
1 | from urllib import request,parse |
{
"args": {},
"data": "",
"files": {},
"form": {
"name": "China"
},
"headers": {
"Accept-Encoding": "identity",
"Connection": "close",
"Content-Length": "10",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "Mozilla/4.0(compatible;MSIE 5.5;Windows NT)"
},
"json": null,
"origin": "14.154.28.26",
"url": "http://httpbin.org/post"
}
Handler
设置代理。可以通过来回切换代理,以防止服务器禁用我们的IP。
1 | import urllib.request |
Cookie相关操作
用来维持登录状态。可以在遇到需要登录的网站使用。
1 | # 获取cookie |
BAIDUID=417382BEB774A45EA7FC2C374C846E04:FG=1
BIDUPSID=417382BEB774A45EA7FC2C374C846E04
H_PS_PSSID=25639_1446_21107_20930
PSTM=1516632429
BDSVRTM=0
BD_HOME=0
保存cookie到本地文件中
1 | import http.cookiejar, urllib.request |
另一种cookie保存格式
1 | import http.cookiejar, urllib.request |
加载本地cookie到请求中去。用什么样的方式存cookie,就用什么样的方式去读。
1 | import http.cookiejar, urllib.request |
{
"args": {},
"headers": {
"Accept-Encoding": "identity",
"Connection": "close",
"Host": "httpbin.org",
"User-Agent": "Python-urllib/3.6"
},
"origin": "14.154.29.0",
"url": "http://httpbin.org/get"
}
页面请求时的异常处理
1 | from urllib import request, error |
Not Found
详细的错误类型
1 | from urllib import request, error |
Not Found
404
Server: GitHub.com
Content-Type: text/html; charset=utf-8
ETag: "5a62c123-215e"
Access-Control-Allow-Origin: *
X-GitHub-Request-Id: F758:1D9CC:916D4F:9B5FD5:5A65FE22
Content-Length: 8542
Accept-Ranges: bytes
Date: Mon, 22 Jan 2018 15:09:43 GMT
Via: 1.1 varnish
Age: 148
Connection: close
X-Served-By: cache-hnd18733-HND
X-Cache: HIT
X-Cache-Hits: 1
X-Timer: S1516633784.646920,VS0,VE0
Vary: Accept-Encoding
X-Fastly-Request-ID: fb6b42a4706a17c3de8bd318b76be3513b68c49b
加上原因判断
1 | import socket |
<class 'socket.timeout'>
TIME OUT
网页解析
解析URL
1 | from urllib.parse import urlparse |
<class 'urllib.parse.ParseResult'>
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=5', fragment='comment')
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment='')
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')
URL拼装
1 | from urllib.parse import urlunparse |
http://www.baidu.com/index.html;user?a=6#comment
1 | from urllib.parse import urlencode |
http://www.baidu.com?name=germey&age=22
Python爬虫常用库的使用