快捷搜索:

urllib库使用详解

目录:

一、什么是Urllib库?

二、urllib用法讲解

 

 

一、什么是Urllib库?

Python内置的HTTP请求库

urllib.request:请求模块

urllib.error:异常处理模块

urllib.parse:url解析模块(拆分、合并等)

urllib.robotparser:robot.txt解析模块

 

二、urllib用法讲解

 

1.urlopen

解析

urllib.request.urlopen(url,data = None,[timeout]*,cafile = None,capath = None,cadefault = False,context = None)#urlopen前三个分别(网站,网站的数据,超时设置)

爬虫第一步(urlopen操作):

  1. from urllib import request
  2.  
  3. response = request.urlopen('http://www.baidu.com')
  4.  
  5. print(response.read().decode('utf-8'))#获取响应体的内容

post类型的请求(parse操作):

  1. from urllib import parse
  2.  
  3. data = bytes(parse.urlencode({'word':'hello'}),encoding = 'utf8')
  4.  
  5. response1 = request.urlopen('http://httpbin.org/post',data = data)#http://httpbin.org/是一个做http测试的网站
  6.  
  7. print(response1.read())

timeou超时设置

  1. response2 = request.urlopen('http://httpbin.org/get',timeout = 1)#将超时时间设置为1秒
  2.  
  3. print(response2.read())
  1. try:
  2. response3 = request.urlopen('http://httpbin.org/get',timeout = 0.1)#将超时时间设置为0.1秒
  3. except error.URLError as e:
  4. if isinstance(e.reason,socket.timeout):#使用isinstance判断error的原因是否是timeout
  5. print('TIME OUT')

2.响应

响应类型

  1. print(type(response))#保留原本的response,自己也可以另行设置一个新的response
  2. Out[20]: http.client.HTTPResponse
状态码、响应头
  1. print(response.status)#状态码
  2. print(response.getheaders())#响应头
  3. print(response.getheaders('Set-Cookie'))#响应头内信息类型为字典的,可以通过键名找到对应的值

 

3.Request

  1. from urllib import request
  2. from urllib import parse,error
  3. request1 = request.Request('http://python.org/')#此步骤为请求,对比urllib的使用可知可省略
  4. response = request.urlopen(request1)
  5. print(response.read().decode('utf-8'))
  1. from urllib import parse,request,error
  2.  
  3. import socket
  4.  
  5. url = 'http://httpbin.org/post'#构造一个POST请求
  6. headers = {
  7. 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64)',
  8. 'Host':'httpbin.org'
  9. }
  10.  
  11. dict1 = {
  12. 'name':'Germey'
  13. }
  14.  
  15. data = bytes(parse.urlencode(dict1),encoding='utf8')#fontdata数据
  16.  
  17. req = request.Request(url = url,data = data,headers = headers,method = 'POST')#整一个Request()的一个结构
  18.  
  19. response = request.urlopen(req)
  20. print(response.read().decode('utf-8'))#输出结构中可以看出我们前面所构造的headers和dict1

下面为构造POST请求的另一种方式:

  1. req1 = request.Request(url = url,data = data,method = 'POST')
  2. req1.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64)')#使用add_header添加
  3.  
  4. response = request.urlopen(req1)
  5.  
  6. print(response.read().decode('utf-8'))

 

4.Headler:

代理(https://docs.python.org/3/library/urllib.request.html#module-urllib.request官方文档)

  1. from urllib import request
  2. proxy_handler = request.ProxyHandler(
  3. {'http':'http://127.0.0.1:9743',
  4. 'https':'https://127.0.0.1:9743'
  5. })#此IP为过期IP,最近我的途径被封了,无法为大家展示><sorry
  6.  
  7. opener = request.build_opener(proxy_handler)
  8.  
  9. response = opener.open('http://www.baidu.com')
  10. print(response.read())

 

5.Cookie客户端保存,用来记录客户身份的文本文件、维持登录状态

  1. from urllib import request
  2.  
  3. from http import cookiejar
  4.  
  5. cookie =cookiejar.CookieJar()#设置一个cookie栈
  6.  
  7. handler = request.HTTPCookieProcessor(cookie)
  8.  
  9. opener = request.build_opener(handler)
  10.  
  11. response =opener.open('http://www.baidu.com')
  12.  
  13. for item in cookie:
  14. print(item.name+'='+item.value)

 

6.异常处理

  1. from urllib import error
  2. #我们试着访问一个不存在的网址
  3. try:
  4. response = request.urlopen('http://www.cuiqingcai.com/index.html')#http://www.cuiqingcai.com/此链接为崔老师的个人博客
  5. except error.URLError as e:
  6. print(e.reason)#通过审查可以查到我们捕捉的异常是否与之相符

可以捕获的异常(https://docs.python.org/3/library/urllib.error.html#module-urllib.error官方文档):

  1.     try:
  2. response = request.urlopen('http://www.cuiqingcai.com/index.html')
  3. except error.HTTPError as e: #最好先捕捉HTTPError再捕捉其他的异常
  4. print(e.reason,e.code,e.headers,sep='\n')
  5. except error.URLError as e:
  6. print(e.reason)
  7. else:
  8. print('Request Successfully')
  1. try:
  2. response = request.urlopen('http://www.baidu.com',timeout = 0.01)#超时异常
  3. except error.URLError as e:
  4. print(type(e.reason))
  5. if isinstance(e.reason,socket.timeout):#判断error类型
  6. print('TIME OUT')

 

7.URL解析https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse官方文档):

urlparse(将url进行分割,分割成好几个部分,再依次将其复制)

parse.urlparse(urlstring,scheme='',allow_fragments = True)#(url,协议类型,#后面的东西)
  1. from urllib.parse import urlparse
  2.  
  3. result = urlparse('https://www.baidu.com/s?wd=urllib&ie=UTF-8')
  4. print(type(result),result) #<class 'urllib.parse.ParseResult'>
  5.  
  6. #无协议类型指定,自行添加的情况
  7. result = urlparse('www.baidu.com/s?wd=urllib&ie=UTF-8',scheme = 'https')
  8. print(result)
  9.  
  10. #有指定协议类型,添加的情况
  11. result1 = urlparse('http://www.baidu.com/s?wd=urllib&ie=UTF-8',scheme = 'https')
  12.  
  13. print(result1)
  14. #allow_fragments参数使用
  15. result1 = urlparse('http://www.baidu.com/s?#comment',allow_fragments = False)
  16.  
  17. result2 = urlparse('http://www.baidu.com/s?wd=urllib&ie=UTF-8#comment',allow_fragments = False)
  18. print(result1,result2)#allow_fragments=False表示#后面的东西不能填,原本在fragment位置的参数就会往上一个位置拼接,可以对比result1和result2的区别

urlunparse(urlparse的反函数)

举个栗子

  1. from urllib.parse import urlunparse
  2. #data可以通过urlparse得出的参数往里面带,注意:即使是空符号也要写进去,不然会出错
  3. data = ['https', '', 'www.baidu.com/s', '', 'wd=urllib&ie=UTF-8', '']
  4.  
  5. print(urlunparse(data))

urjoin(拼接URL):

  1. from urllib.parse import urljoin
  2. #总的来说:无论是正常链接或是随便打的,都可以拼接,如果同时出现完整链接'http'或是'https',不会产生拼接,而会打印后者的链接
  3. print(urljoin('http://www.baidu.com','FQA.html'))
  4. http://www.baidu.com/FQA.html
  5.  
  6. print(urljoin('http://www.baidu.com','http://www.caiqingcai.com/FQA.html'))
  7. http://www.caiqingcai.com/FQA.html
  8.  
  9. print(urljoin('https://www.baidu.com/about.html','http://www.caiqingcai.com/FQA.html'))
  10. http://www.caiqingcai.com/FQA.html
  11.  
  12. print(urljoin('http://www.baidu.com/about.html','https://www.caiqingcai.com/FQA.html'))
  13. https://www.caiqingcai.com/FQA.html

urlencode(字典对象转化为get请求参数):

  1. from urllib.parse import urlencode
  2.  
  3. params = {
  4. 'name':'Arise',
  5. 'age':'21'
  6. }
  7.  
  8. base_url = 'http://www.baidu.com?'
  9.  
  10. url = base_url+urlencode(params)
  11.  
  12. print(url)
  13. http://www.baidu.com?name=Arise&age=21
robotparser(用来解析robot.txt):

官方文档:https://docs.python.org/3/library/urllib.robotparser.html#module-urllib.robotparser(只做了解)

  1. import urllib.robotparser
  2. rp = urllib.robotparser.RobotFileParser()
  3. rp.set_url("http://www.musi-cal.com/robots.txt")
  4. rp.read()
  5. rrate = rp.request_rate("*")
  6. rrate.requests
  7. #3
  8. rrate.seconds
  9. #20
  10. rp.crawl_delay("*")
  11. #6
  12. rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
  13. #False
  14. rp.can_fetch("*", "http://www.musi-cal.com/")
  15. #True

     

您可能还会对下面的文章感兴趣: