Python3でクローラーを書くときに遭遇する問題とその解決方法

2022-02-07 04:15:29

バイトのようなオブジェクトで文字列パターンを使用することはできません。

The error I am getting is in the following code.

re.findall(pattern, data)
This time how data has a data type of bytes, it will wrap this error because it needs to be a string.
We can change the code above to

type(data)
re.findall(pattern, data)
The printed result.

<class 'str'>
So we have to convert the data type to the string str type before using the re.findall() method. Methods.

re.findall(pattern, data.decode('utf-8'))
Where the decode and encode methods convert the flow.
      decode encode

bytes ------> str(unicode)------>bytes

The second reference link says that the findall parameter type is now `chart-like` in `python3`, which is str,
I want to clarify here, I checked the official documentation, even in python2 it is str. The argument type has not been changed.

The reason for this is that first of all your header is configured with

'Accept-Encoding':' gzip, deflate'  
The next thing is that when you call the read() method, you call the decode('utf-8') method again, as follows.

data = op.read().decode('utf-8')
# because op.read() data has not been decompressed yet and then calling the decode() method will report the above exception.
Accept-Encoding This sentence: The data is received locally in compressed format, and the server compresses the large file and sends it back to the client when it is processed.
The browser decompresses this file again locally after receiving it. The reason for the error is that your application did not decompress the file.

# unzip


def ungzip(data):
    try:
        print('Decompressing 。。。。')
        data = gzip.decompress(data)
        #data = gzip.decompress(data).decode('utf-8')
        print('Decompression complete')
    except:
        print('uncompressed, no need to decompress')
    return data.decode('utf-8')
The code I'm reading here would look like this

data = op.read()
#data = op.read().decode('utf-8') 
#don't write it like this, because the op.read() data is not yet decompressed and then call the decode() method will report the above exception.
data = ungzip(data)

First I give the full code.

# -*- coding:utf-8 -*-
import re
import urllib
import urllib.request
import gzip
import http.cookiejar
import io
import sys
import string
# gb18030
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
# decompress


def ungzip(data):
    try:
        print('decompressing 。。。。')
        data = gzip.decompress(data)
        print('Decompression complete')
    except:
        print('Uncompressed, no need to decompress')
    return data.decode('utf-8')

# Get xsrf


def getXSRF(data):
    cer = re.compile('name="_xsrf" value="(. *)"', flags=0)
    strlist = cer.findall(data)
    return strlist[0]

# Wrap the request header


def getOpener(head):
    # deal with the Cookies
    cj = http.cookiejar.CookieJar()
    pro = urllib.request.HTTPCookieProcessor(cj)
    opener = urllib.request.build_opener(pro)
    header = []
    for key, value in head.items():
        elem = (key, value)
        header.append(elem)
    opener.addheaders = header
    return opener

# Save


def saveFile(data):
    data = data.encode('utf-8')
    save_path = 'E:\temp.out'
    f_obj = open(save_path, 'wb') # wb means open
    f_obj.write(data)
    f_obj.close()


# Request header value
header = {
    'Connection': 'Keep-alive',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
    'Accept-Encoding': 'gzip,deflate',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
    'Host': 'www.qiushibaike.com'
}


# page = 1
url = 'http://www.qiushibaike.com/hot/'
# get request headers
# opener = getOpener(header)
# op = opener.open(url)
# data = op.read()
# data = ungzip(data) # decompress
# _xsrf = getXSRF(data.decode())


try:
    opener = getOpener(header)
    op = opener.open(url)
    data = op.read()

    data = ungzip(data)
    # op = urllib.request.urlopen(url)
    strRex = ('
. *?
(. *?)
. *?
(. *?)
'
 +
              '. *?
(. *?)
(. *?)
(. *?)'
)
    pattern = re.compile(strRex, re.)
    print(type(data))
    items = re.findall(pattern, data)

    for item in items:
          print(item[0] + item[1] + item[2] + item[3])
    # print(item)
    print(items)
    # saveFile(''.join(str(e) for e in items)) # correct code
    saveFile(items)
except Exception as e:
    print(e)

The reason for the above error is the execution of this code.

saveFile(items)
And in the saveFile function, `f_obj = open(save_path, 'wb')` you can see that "wb", the
opens the file in binary mode and is writable. And the items I inserted is an array, so it reports an error.

f_obj = open(save_path, 'w')
Executing the code shows that it again requires the str string type to be passed in. That is, when we don't specify that the file is opened as binary (b), the
The default write is the str type.
Solution: First convert the data type to be written to the file to str, and then in saveFile, convert the str type to bytes.
First, we still open the file in binary, we add in the saveFile method

data = data.encode('utf-8')
The encode method is what converts the str type to bytes.
The list and tuple tuples are converted to str using the `'.join()` method
At first I wrote it as

saveFile(items)

It means that when it takes the first element in the sequence, it expects a str string, and it finds a tuple;
Which means we have to iterate through the array first to convert the tuple to a str string type.

''.join(str(e) for e in items)
This code, from right to left, goes through the items first, then uses the str() method on each item to convert to str.
And each item is a tuple, so again you use the join function to convert the tuple to a str type `'.join()`.
And then there's the tuple to str string, which prints the same as if it hadn't been converted, but the difference can be seen with the type(str(e)) method.

    s = ('a', 'b', 'c')
    print(str(s))
    print(s)
    print(type(str(s)))
    print(type(s))
The result of the printout is.

('a', 'b', 'c')
('a', 'b', 'c')
<class 'str'>
<class 'tuple'>
So the solution is to change saveFile(items) to.

saveFile(''.join(str(e) for e in items))
Finally posting the interconversion between list tuple str

The list() method converts a string str or tuple into an array
The tuple() method converts a string str or an array to a tuple

>>> s = "xxxxxxx"
>>> list(s)
['x', 'x', 'x', 'x', 'x']
>>> tuple(s)
('x', 'x', 'x', 'x', 'x')
>>> tuple(list(s))
('x', 'x', 'x', 'x', 'x')
>>> list(tuple(s))
['x', 'x', 'x', 'x', 'x']
List and tuple conversions to strings must rely on the join function

>>> "".join(tuple(s))
'xxxxxxx'
>>> "".join(list(s))
'xxxxxxx'
>>> str(tuples(s))
"('x', 'x', 'x', 'x', 'x', 'x')"# If you use the sublime text 3 plugin sublimeREPl, the outer double quotes are not displayed. Same as above.
>>>

Where the decode and encode methods convert the flow.
      decode encode

bytes ------> str(unicode)------>bytes

参考リンクです。

http://blog.csdn.net/moodytong/article/details/8136258
http://blog.csdn.net/riyao/article/details/3629910

The second reference link says that the findall parameter type is now `chart-like` in `python3`, which is str,
I want to clarify here, I checked the official documentation, even in python2 it is str. The argument type has not been changed.

utf-8」コーデックは、ポジション1のバイト0x8bをデコードできません。

The reason for this is that first of all your header is configured with

'Accept-Encoding':' gzip, deflate'

The next thing is that when you call the read() method, you call the decode('utf-8') method again, as follows.

data = op.read().decode('utf-8')
# because op.read() data has not been decompressed yet and then calling the decode() method will report the above exception.

Accept-Encoding This sentence: The data is received locally in compressed format, and the server compresses the large file and sends it back to the client when it is processed.
The browser decompresses this file again locally after receiving it. The reason for the error is that your application did not decompress the file.

ここで重要なのは、ネット上の回答は、Accept-Encodingを削除することです。

解凍してないのが原因ってわかってるんだから、解凍すればいいだけだと思うんだけど、なんで削除するんだろう。
そこで、正しいのはまずデータを解凍することです。以下は私の解凍コードです。

# unzip


def ungzip(data):
    try:
        print('Decompressing 。。。。')
        data = gzip.decompress(data)
        #data = gzip.decompress(data).decode('utf-8')
        print('Decompression complete')
    except:
        print('uncompressed, no need to decompress')
    return data.decode('utf-8')

The code I'm reading here would look like this

data = op.read()
#data = op.read().decode('utf-8') 
#don't write it like this, because the op.read() data is not yet decompressed and then call the decode() method will report the above exception.
data = ungzip(data)

リスト'ではなく、バイトのようなオブジェクトが必要です。

First I give the full code.

# -*- coding:utf-8 -*-
import re
import urllib
import urllib.request
import gzip
import http.cookiejar
import io
import sys
import string
# gb18030
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
# decompress


def ungzip(data):
    try:
        print('decompressing 。。。。')
        data = gzip.decompress(data)
        print('Decompression complete')
    except:
        print('Uncompressed, no need to decompress')
    return data.decode('utf-8')

# Get xsrf


def getXSRF(data):
    cer = re.compile('name="_xsrf" value="(. *)"', flags=0)
    strlist = cer.findall(data)
    return strlist[0]

# Wrap the request header


def getOpener(head):
    # deal with the Cookies
    cj = http.cookiejar.CookieJar()
    pro = urllib.request.HTTPCookieProcessor(cj)
    opener = urllib.request.build_opener(pro)
    header = []
    for key, value in head.items():
        elem = (key, value)
        header.append(elem)
    opener.addheaders = header
    return opener

# Save


def saveFile(data):
    data = data.encode('utf-8')
    save_path = 'E:\temp.out'
    f_obj = open(save_path, 'wb') # wb means open
    f_obj.write(data)
    f_obj.close()


# Request header value
header = {
    'Connection': 'Keep-alive',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
    'Accept-Encoding': 'gzip,deflate',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
    'Host': 'www.qiushibaike.com'
}


# page = 1
url = 'http://www.qiushibaike.com/hot/'
# get request headers
# opener = getOpener(header)
# op = opener.open(url)
# data = op.read()
# data = ungzip(data) # decompress
# _xsrf = getXSRF(data.decode())


try:
    opener = getOpener(header)
    op = opener.open(url)
    data = op.read()

    data = ungzip(data)
    # op = urllib.request.urlopen(url)
    strRex = ('
. *?
(. *?)
. *?
(. *?)
'
 +
              '. *?
(. *?)
(. *?)
(. *?)'
)
    pattern = re.compile(strRex, re.)
    print(type(data))
    items = re.findall(pattern, data)

    for item in items:
          print(item[0] + item[1] + item[2] + item[3])
    # print(item)
    print(items)
    # saveFile(''.join(str(e) for e in items)) # correct code
    saveFile(items)
except Exception as e:
    print(e)

The reason for the above error is the execution of this code.

saveFile(items)

And in the saveFile function, `f_obj = open(save_path, 'wb')` you can see that "wb", the
opens the file in binary mode and is writable. And the items I inserted is an array, so it reports an error.

いくつかの情報を確認したところ、このような場合は b というパラメータになります。

f_obj = open(save_path, 'w')

Executing the code shows that it again requires the str string type to be passed in. That is, when we don't specify that the file is opened as binary (b), the
The default write is the str type.
Solution: First convert the data type to be written to the file to str, and then in saveFile, convert the str type to bytes.
First, we still open the file in binary, we add in the saveFile method

data = data.encode('utf-8')

The encode method is what converts the str type to bytes.
The list and tuple tuples are converted to str using the `'.join()` method
At first I wrote it as

saveFile(items)

その結果、再び報告されます。

シーケンスアイテム 0: 予想される str インスタンス、タプルが見つかりました。

It means that when it takes the first element in the sequence, it expects a str string, and it finds a tuple;
Which means we have to iterate through the array first to convert the tuple to a str string type.

''.join(str(e) for e in items)

This code, from right to left, goes through the items first, then uses the str() method on each item to convert to str.
And each item is a tuple, so again you use the join function to convert the tuple to a str type `'.join()`.
And then there's the tuple to str string, which prints the same as if it hadn't been converted, but the difference can be seen with the type(str(e)) method.

    s = ('a', 'b', 'c')
    print(str(s))
    print(s)
    print(type(str(s)))
    print(type(s))

The result of the printout is.

('a', 'b', 'c')
('a', 'b', 'c')
<class 'str'>
<class 'tuple'>

So the solution is to change saveFile(items) to.

saveFile(''.join(str(e) for e in items))

Finally posting the interconversion between list tuple str

The list() method converts a string str or tuple into an array
The tuple() method converts a string str or an array to a tuple

>>> s = "xxxxxxx"
>>> list(s)
['x', 'x', 'x', 'x', 'x']
>>> tuple(s)
('x', 'x', 'x', 'x', 'x')
>>> tuple(list(s))
('x', 'x', 'x', 'x', 'x')
>>> list(tuple(s))
['x', 'x', 'x', 'x', 'x']

List and tuple conversions to strings must rely on the join function

>>> "".join(tuple(s))
'xxxxxxx'
>>> "".join(list(s))
'xxxxxxx'
>>> str(tuples(s))
"('x', 'x', 'x', 'x', 'x', 'x')"# If you use the sublime text 3 plugin sublimeREPl, the outer double quotes are not displayed. Same as above.
>>>

参考リンクです。

http://blog.csdn.net/sruru/article/details/7803208
http://stackoverflow.com/questions/5618878/how-to-convert-list-to-string
http://piziyin.blog.51cto.com/2391349/568426
http://stackoverflow.com/questions/33054527/python-3-5-typeerror-a-bytes-like-object-is-required-not-str

Python3でクローラーを書くときに遭遇する問題とその解決方法

バイトのようなオブジェクトで文字列パターンを使用することはできません。

(. *?)

utf-8」コーデックは、ポジション1のバイト0x8bをデコードできません。

リスト'ではなく、バイトのようなオブジェクトが必要です。

(. *?)

シーケンスアイテム 0: 予想される str インスタンス、タプルが見つかりました。

関連

Python Decorator 練習問題

[解決済み】Flask ImportError: Flask という名前のモジュールがない

[解決済み] 非順序に対するPythonの反復処理

[解決済み] 'MyClass' オブジェクトには 'getitem' という属性がありません。

[解決済み] データ型「datetime64[ns]」と「<M8[ns]」との違い？

[解決済み] テストの点数を5つ入力させるプログラムを作成しなさい。それをレターグレードに対応させる

[解決済み] Python Pandasで複数の列を適当に埋める

[解決済み] (Tensorflow-GPU) import tensorflow ImportError: cudnn64_7.dll' が見つかりませんでした。

python2 Solve TypeError: 'encoding' is invalid keyword argument for this function.

TypeError: 'bool' オブジェクトは呼び出し可能ではありません。

最新

nginxです。[emerg] 0.0.0.0:80 への bind() に失敗しました (98: アドレスは既に使用中です)

htmlページでギリシャ文字を使うには

ピュアhtml+cssでの要素読み込み効果

純粋なhtml + cssで五輪を実現するサンプルコード

ナビゲーションバー・ドロップダウンメニューのHTML+CSSサンプルコード

タイピング効果を実現するピュアhtml+css

htmlの選択ボックスのプレースホルダー作成に関する質問

html css3 伸縮しない画像表示効果

トップナビゲーションバーメニュー作成用HTML+CSS

html+css 実装サイバーパンク風ボタン

おすすめ

[解決済み] Asyncioです。タスクの例外が取得されないという奇妙な事態が発生

[解決済み] Python 3 - ValueError: アンパックする値が足りない (期待値 3、取得値 2)

[解決済み] と[[]]の違いは何ですか？* 2

[解決済み] super().method() と super(self.class,self).method() の違いについて [重複]。

[解決済み] Python socket.error: [Errno 111] 接続が拒否されました

[解決済み] ModuleNotFoundError: python 3.9 には 'scipy' という名前のモジュールはありません。

[解決済み] NameError: 名前 'get_ipython' が定義されていません。

[解決済み] Pythonの新スタイルのプロパティで「属性を設定できない」ことがある

[解決済み] Python 2.7でpylabを関数レベルでインポートするには、どのような方法が望ましいですか？

エラー：イテレータはバイトではなく文字列を返すべき（ファイルをテキストで開いたか？