安装教程
这个自己去网上找(纯小白安装的话比较麻烦)
第三方库安装
通过Pycharm终端安装
1 2 3
| //终端安装 pip install 模块名 pip uninstall 模块名 //卸载已安装的模块
|
终端安装全局镜像
让你安装库时使用国内的安装源速度更快,Pycharm默认的安装源默认的是国外的,速度堪忧
1 2 3 4 5 6 7 8 9 10
| //不想使用全局镜像源,可以通过如下命令单次使用镜像源下载第三方库: pip install 模块名 -i 国内镜像地址 //全局镜像源配置 pip install --upgrade pip pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/ //国内常用源镜像地址 清华:https://pypi.tuna.tsinghua.edu.cn/simple 中国科技大学:https://pypi.mirrors.ustc.edu.cn/simple/ 阿里云:https://mirrors.aliyun.com/pypi/simple/ 华中理工大学:http://pypi.hustunique.com/
|
Python爬虫实战一
此段代码只适用请求json格式的数据,不适用html
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
| import requests import json import sys import time import pandas as pd
login_url = 'https://test.com'
username = '账号' password = '密码'
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0', }
with requests.Session() as session: session.headers.update(headers) responses = session.post(login_url, data={'email': username, 'password': password}) if not responses.ok: print('登录失败')
else: print('登录成功')
def get_page_data(url, params): response = requests.get(url, params=params) if response.status_code == 200: return response.json() else: print(f"Failed to retrieve page with status code {response.status_code}") return None
def crawl_paginated_data(url, start_page, end_page, base_params): all_data = [] for page in range(start_page, end_page + 1): page_params = base_params.copy() page_params['page'] = page time.sleep(3) page_data = get_page_data(url, page_params) if page_data: data_list = page_data.get('data', {}).get('videoList', []) all_data.extend(data_list) return all_data
url = '数据列表请求的基础链接URL 下面base_params是链接后的参数' start_page = 1 end_page = 1 base_params = { 'o': 't', 'lang': 'cn', 'u': '545efd60-70c6-41eb-86bc-b62931f02637', 't': '0' } headers = ['标题', '昵称', '时长', '链接码', 'URL'] data_frames = [] count = 0 data_list = crawl_paginated_data(url, start_page, end_page, base_params) for item in data_list: count += 1 url = f"https://test.com/get?lang=cn&v={item.get('i')}&u=参数一&p=参数二" response = requests.get(url) if response.status_code == 200: item_data = response.json() excel_data = item_data['data']['v'] data = item_data['data']['v'] o_data = data['o'] print(f"获取成功---->[次数:{count}][标题:{data['t']}]") row_data = { '标题': data['t'], '昵称': o_data['n'], '时长': data['d'], '链接码': data['i'], 'URL': f"打印到Excel文档中的URL格式例如下面注释" } df = pd.DataFrame([row_data]) data_frames.append(df) else: print(f"请求失败,状态码:{response.status_code}")
if data_frames: result_df = pd.concat(data_frames, ignore_index=True) excel_file = '测试1.xlsx' result_df.to_excel(excel_file, index=False) else: print("没有获取到任何数据。")
|