1. ホーム
  2. python

Python3でクローラーを書くときに遭遇する問題とその解決方法

2022-02-07 04:15:29
<パス

バイトのようなオブジェクトで文字列パターンを使用することはできません。

The error I am getting is in the following code.

re.findall(pattern, data)
This time how data has a data type of bytes, it will wrap this error because it needs to be a string.
We can change the code above to

type(data)
re.findall(pattern, data)
The printed result.

<class 'str'>
So we have to convert the data type to the string str type before using the re.findall() method. Methods.

re.findall(pattern, data.decode('utf-8'))
Where the decode and encode methods convert the flow.
      decode encode

bytes ------> str(unicode)------>bytes


The second reference link says that the findall parameter type is now `chart-like` in `python3`, which is str,
I want to clarify here, I checked the official documentation, even in python2 it is str. The argument type has not been changed.


The reason for this is that first of all your header is configured with

'Accept-Encoding':' gzip, deflate'  
The next thing is that when you call the read() method, you call the decode('utf-8') method again, as follows.

data = op.read().decode('utf-8')
# because op.read() data has not been decompressed yet and then calling the decode() method will report the above exception.
Accept-Encoding This sentence: The data is received locally in compressed format, and the server compresses the large file and sends it back to the client when it is processed.
The browser decompresses this file again locally after receiving it. The reason for the error is that your application did not decompress the file.


# unzip


def ungzip(data):
    try:
        print('Decompressing 。。。。')
        data = gzip.decompress(data)
        #data = gzip.decompress(data).decode('utf-8')
        print('Decompression complete')
    except:
        print('uncompressed, no need to decompress')
    return data.decode('utf-8')
The code I'm reading here would look like this

data = op.read()
#data = op.read().decode('utf-8') 
#don't write it like this, because the op.read() data is not yet decompressed and then call the decode() method will report the above exception.
data = ungzip(data)

First I give the full code.

# -*- coding:utf-8 -*-
import re
import urllib
import urllib.request
import gzip
import http.cookiejar
import io
import sys
import string
# gb18030
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
# decompress


def ungzip(data):
    try:
        print('decompressing 。。。。')
        data = gzip.decompress(data)
        print('Decompression complete')
    except:
        print('Uncompressed, no need to decompress')
    return data.decode('utf-8')

# Get xsrf


def getXSRF(data):
    cer = re.compile('name="_xsrf" value="(. *)"', flags=0)
    strlist = cer.findall(data)
    return strlist[0]

# Wrap the request header


def getOpener(head):
    # deal with the Cookies
    cj = http.cookiejar.CookieJar()
    pro = urllib.request.HTTPCookieProcessor(cj)
    opener = urllib.request.build_opener(pro)
    header = []
    for key, value in head.items():
        elem = (key, value)
        header.append(elem)
    opener.addheaders = header
    return opener

# Save


def saveFile(data):
    data = data.encode('utf-8')
    save_path = 'E:\temp.out'
    f_obj = open(save_path, 'wb') # wb means open
    f_obj.write(data)
    f_obj.close()


# Request header value
header = {
    'Connection': 'Keep-alive',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
    'Accept-Encoding': 'gzip,deflate',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
    'Host': 'www.qiushibaike.com'
}


# page = 1
url = 'http://www.qiushibaike.com/hot/'
# get request headers
# opener = getOpener(header)
# op = opener.open(url)
# data = op.read()
# data = ungzip(data) # decompress
# _xsrf = getXSRF(data.decode())


try:
    opener = getOpener(header)
    op = opener.open(url)
    data = op.read()

    data = ungzip(data)
    # op = urllib.request.urlopen(url)
    strRex = ('
. *?

(. *?)

. *?
(. *?)
'
+ '. *?
(. *?)
(. *?)
(. *?)'
) pattern = re.compile(strRex, re.) print(type(data)) items = re.findall(pattern, data) for item in items: print(item[0] + item[1] + item[2] + item[3]) # print(item) print(items) # saveFile(''.join(str(e) for e in items)) # correct code saveFile(items) except Exception as e: print(e) The reason for the above error is the execution of this code. saveFile(items) And in the saveFile function, `f_obj = open(save_path, 'wb')` you can see that "wb", the opens the file in binary mode and is writable. And the items I inserted is an array, so it reports an error.
b

f_obj = open(save_path, 'w')
Executing the code shows that it again requires the str string type to be passed in. That is, when we don't specify that the file is opened as binary (b), the
The default write is the str type.
Solution: First convert the data type to be written to the file to str, and then in saveFile, convert the str type to bytes.
First, we still open the file in binary, we add in the saveFile method

data = data.encode('utf-8')
The encode method is what converts the str type to bytes.
The list and tuple tuples are converted to str using the `'.join()` method
At first I wrote it as

saveFile(items)

It means that when it takes the first element in the sequence, it expects a str string, and it finds a tuple;
Which means we have to iterate through the array first to convert the tuple to a str string type.

''.join(str(e) for e in items)
This code, from right to left, goes through the items first, then uses the str() method on each item to convert to str.
And each item is a tuple, so again you use the join function to convert the tuple to a str type `'.join()`.
And then there's the tuple to str string, which prints the same as if it hadn't been converted, but the difference can be seen with the type(str(e)) method.

    s = ('a', 'b', 'c')
    print(str(s))
    print(s)
    print(type(str(s)))
    print(type(s))
The result of the printout is.

('a', 'b', 'c')
('a', 'b', 'c')
<class 'str'>
<class 'tuple'>
So the solution is to change saveFile(items) to.

saveFile(''.join(str(e) for e in items))
Finally posting the interconversion between list tuple str

The list() method converts a string str or tuple into an array
The tuple() method converts a string str or an array to a tuple

>>> s = "xxxxxxx"
>>> list(s)
['x', 'x', 'x', 'x', 'x']
>>> tuple(s)
('x', 'x', 'x', 'x', 'x')
>>> tuple(list(s))
('x', 'x', 'x', 'x', 'x')
>>> list(tuple(s))
['x', 'x', 'x', 'x', 'x']
List and tuple conversions to strings must rely on the join function

>>> "".join(tuple(s))
'xxxxxxx'
>>> "".join(list(s))
'xxxxxxx'
>>> str(tuples(s))
"('x', 'x', 'x', 'x', 'x', 'x')"# If you use the sublime text 3 plugin sublimeREPl, the outer double quotes are not displayed. Same as above.
>>> 

Where the decode and encode methods convert the flow.
      decode encode

bytes ------> str(unicode)------>bytes


参考リンクです。

http://blog.csdn.net/moodytong/article/details/8136258
http://blog.csdn.net/riyao/article/details/3629910

The second reference link says that the findall parameter type is now `chart-like` in `python3`, which is str,
I want to clarify here, I checked the official documentation, even in python2 it is str. The argument type has not been changed.


utf-8」コーデックは、ポジション1のバイト0x8bをデコードできません。

The reason for this is that first of all your header is configured with


'Accept-Encoding':' gzip, deflate'  

The next thing is that when you call the read() method, you call the decode('utf-8') method again, as follows.


data = op.read().decode('utf-8')
# because op.read() data has not been decompressed yet and then calling the decode() method will report the above exception.

Accept-Encoding This sentence: The data is received locally in compressed format, and the server compresses the large file and sends it back to the client when it is processed.
The browser decompresses this file again locally after receiving it. The reason for the error is that your application did not decompress the file.


ここで重要なのは、ネット上の回答は、Accept-Encodingを削除することです。


解凍してないのが原因ってわかってるんだから、解凍すればいいだけだと思うんだけど、なんで削除するんだろう。
そこで、正しいのはまずデータを解凍することです。以下は私の解凍コードです。

# unzip


def ungzip(data):
    try:
        print('Decompressing 。。。。')
        data = gzip.decompress(data)
        #data = gzip.decompress(data).decode('utf-8')
        print('Decompression complete')
    except:
        print('uncompressed, no need to decompress')
    return data.decode('utf-8')

The code I'm reading here would look like this


data = op.read()
#data = op.read().decode('utf-8') 
#don't write it like this, because the op.read() data is not yet decompressed and then call the decode() method will report the above exception.
data = ungzip(data)

リスト'ではなく、バイトのようなオブジェクトが必要です。

First I give the full code.


# -*- coding:utf-8 -*-
import re
import urllib
import urllib.request
import gzip
import http.cookiejar
import io
import sys
import string
# gb18030
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
# decompress


def ungzip(data):
    try:
        print('decompressing 。。。。')
        data = gzip.decompress(data)
        print('Decompression complete')
    except:
        print('Uncompressed, no need to decompress')
    return data.decode('utf-8')

# Get xsrf


def getXSRF(data):
    cer = re.compile('name="_xsrf" value="(. *)"', flags=0)
    strlist = cer.findall(data)
    return strlist[0]

# Wrap the request header


def getOpener(head):
    # deal with the Cookies
    cj = http.cookiejar.CookieJar()
    pro = urllib.request.HTTPCookieProcessor(cj)
    opener = urllib.request.build_opener(pro)
    header = []
    for key, value in head.items():
        elem = (key, value)
        header.append(elem)
    opener.addheaders = header
    return opener

# Save


def saveFile(data):
    data = data.encode('utf-8')
    save_path = 'E:\temp.out'
    f_obj = open(save_path, 'wb') # wb means open
    f_obj.write(data)
    f_obj.close()


# Request header value
header = {
    'Connection': 'Keep-alive',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
    'Accept-Encoding': 'gzip,deflate',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
    'Host': 'www.qiushibaike.com'
}


# page = 1
url = 'http://www.qiushibaike.com/hot/'
# get request headers
# opener = getOpener(header)
# op = opener.open(url)
# data = op.read()
# data = ungzip(data) # decompress
# _xsrf = getXSRF(data.decode())


try:
    opener = getOpener(header)
    op = opener.open(url)
    data = op.read()

    data = ungzip(data)
    # op = urllib.request.urlopen(url)
    strRex = ('
. *?

(. *?)

. *?
(. *?)
'
+ '. *?
(. *?)
(. *?)
(. *?)'
) pattern = re.compile(strRex, re.) print(type(data)) items = re.findall(pattern, data) for item in items: print(item[0] + item[1] + item[2] + item[3]) # print(item) print(items) # saveFile(''.join(str(e) for e in items)) # correct code saveFile(items) except Exception as e: print(e)
The reason for the above error is the execution of this code.


saveFile(items)

And in the saveFile function, `f_obj = open(save_path, 'wb')` you can see that "wb", the
opens the file in binary mode and is writable. And the items I inserted is an array, so it reports an error.



いくつかの情報を確認したところ、このような場合は b というパラメータになります。

f_obj = open(save_path, 'w')

Executing the code shows that it again requires the str string type to be passed in. That is, when we don't specify that the file is opened as binary (b), the
The default write is the str type.
Solution: First convert the data type to be written to the file to str, and then in saveFile, convert the str type to bytes.
First, we still open the file in binary, we add in the saveFile method


data = data.encode('utf-8')

The encode method is what converts the str type to bytes.
The list and tuple tuples are converted to str using the `'.join()` method
At first I wrote it as


saveFile(items)

その結果、再び報告されます。

シーケンスアイテム 0: 予想される str インスタンス、タプルが見つかりました。

It means that when it takes the first element in the sequence, it expects a str string, and it finds a tuple;
Which means we have to iterate through the array first to convert the tuple to a str string type.


''.join(str(e) for e in items)

This code, from right to left, goes through the items first, then uses the str() method on each item to convert to str.
And each item is a tuple, so again you use the join function to convert the tuple to a str type `'.join()`.
And then there's the tuple to str string, which prints the same as if it hadn't been converted, but the difference can be seen with the type(str(e)) method.


    s = ('a', 'b', 'c')
    print(str(s))
    print(s)
    print(type(str(s)))
    print(type(s))

The result of the printout is.


('a', 'b', 'c')
('a', 'b', 'c')
<class 'str'>
<class 'tuple'>

So the solution is to change saveFile(items) to.


saveFile(''.join(str(e) for e in items))

Finally posting the interconversion between list tuple str

The list() method converts a string str or tuple into an array
The tuple() method converts a string str or an array to a tuple


>>> s = "xxxxxxx"
>>> list(s)
['x', 'x', 'x', 'x', 'x']
>>> tuple(s)
('x', 'x', 'x', 'x', 'x')
>>> tuple(list(s))
('x', 'x', 'x', 'x', 'x')
>>> list(tuple(s))
['x', 'x', 'x', 'x', 'x']

List and tuple conversions to strings must rely on the join function


>>> "".join(tuple(s))
'xxxxxxx'
>>> "".join(list(s))
'xxxxxxx'
>>> str(tuples(s))
"('x', 'x', 'x', 'x', 'x', 'x')"# If you use the sublime text 3 plugin sublimeREPl, the outer double quotes are not displayed. Same as above.
>>> 

参考リンクです。

http://blog.csdn.net/sruru/article/details/7803208
http://stackoverflow.com/questions/5618878/how-to-convert-list-to-string
http://piziyin.blog.51cto.com/2391349/568426
http://stackoverflow.com/questions/33054527/python-3-5-typeerror-a-bytes-like-object-is-required-not-str