通过selenium抓取某东的TT购买记录并分析趋势过程解析

站长资源 2024/12/24 佚名

9 1538 9

铁雪资源网 Design By www.gsvan.com

最近学习了一些爬虫技术，想做个小项目检验下自己的学习成果，在逛某东的时候，突然给我推荐一个TT的产品，点击进去浏览一番之后就产生了抓取TT产品，然后进行数据分析，看下那个品牌的TT卖得最好。

本文通过selenium抓取TT信息，存入到mongodb数据库中。

抓取TT产品信息

TT产品页面的连接是

https://list.jd.com/list.html"text-align: center">

通过上图可以看到一个TT产品信息对应的源代码是一个class为gl-item的li节点<li class='gl-item'>。li节点中data-sku属性是产品的ID，后面抓取产品的评论信息会用到，brand_id是品牌ID。class为p-price的div节点对应的是TT产品的价格信息。class为p-comment的div节点对应的是评论总数信息。

开始使用requests是总是无法解析到TT的价格和评论信息，最后适应selenium才解决了这个问题，如果有人知道怎么解决这问题，望不吝赐教。

下面介绍抓取TT产品评论信息。

点击一个TT产品，会跳转到产品详细页面，点击“商品评论”，然后勾选上“只看当前商品评价”选项（如果不勾选，就会看到该系列产品的评价）就会看到商品评论信息，我们用开发者工具看下如果抓取评论信息。

如上图所示，在开发者工具中，点击Network选项，就会看到

https://club.jd.com/discussion/getSkuProductPageImageCommentList.action"htmlcode">

def parse_product(page,html):
  doc = pq(html)
  li_list = doc('.gl-item').items()
  for li in li_list:
    product_id = li('.gl-i-wrap').attr('data-sku')
    brand_id = li('.gl-i-wrap').attr('brand_id')
    time.sleep(get_random_time())
    title = li('.p-name').find('em').text()
    price_items = li('.p-price').find('.J_price').find('i').items()
    price = 0
    for price_item in price_items:
      price = price_item.text()
      break
    total_comment_num = li('.p-commit').find('strong a').text()
    if total_comment_num.endswith("万+"):
      print('总评价数量：' + total_comment_num)
      total_comment_num = str(int(float(total_comment_num[0:len(total_comment_num) -2]) * 10000))
      print('转换后总评价数量：' + total_comment_num)
    elif total_comment_num.endswith("+"):
      total_comment_num = total_comment_num[0:len(total_comment_num) - 1]
    condom = {}
    condom["product_id"] = product_id
    condom["brand_id"] = brand_id
    condom["condom_name"] = title
    condom["total_comment_num"] = total_comment_num
    condom["price"] = price
    comment_url = 'https://club.jd.com/comment/skuProductPageComments.action"poor_count"] = poor_count
    condom["general_count"] = general_count
    condom["good_count"] = good_count
    condom["comment_count"] = comment_count
    condom["poor_rate"] = poor_rate
    condom["good_rate"] = good_rate
    condom["general_rate"] = general_rate
    condom["default_good_count"] = default_good_count
    collection.insert(condom)
    comments = jsons.get('comments')
    if comments:
      for comment in comments:
        print('解析评论')
        condom_comment = {}
        reference_time = comment.get('referenceTime')
        content = comment.get('content')
        product_color = comment.get('productColor')
        user_client_show = comment.get('userClientShow')
        user_level_name = comment.get('userLevelName')
        is_mobile = comment.get('isMobile')
        creation_time = comment.get('creationTime')
        guid = comment.get("guid")
        condom_comment["reference_time"] = reference_time
        condom_comment["content"] = content
        condom_comment["product_color"] = product_color
        condom_comment["user_client_show"] = user_client_show
        condom_comment["user_level_name"] = user_level_name
        condom_comment["is_mobile"] = is_mobile
        condom_comment["creation_time"] = creation_time
        condom_comment["guid"] = guid
        collection_comment.insert(condom_comment)
    parse_comment(product_id)
def parse_comment(product_id):
  comment_url = 'https://club.jd.com/comment/skuProductPageComments.action"guid")
        id = comment.get("id")
        condom_comment["reference_time"] = reference_time
        condom_comment["content"] = content
        condom_comment["product_color"] = product_color
        condom_comment["user_client_show"] = user_client_show
        condom_comment["user_level_name"] = user_level_name
        condom_comment["is_mobile"] = is_mobile
        condom_comment["creation_time"] = creation_time
        condom_comment["guid"] = guid
        condom_comment["id"] = id
        collection_comment.insert(condom_comment)
    else:
      break

如果想要获取抓取TT数据和评论的代码，请关注我的公众号“python_ai_bigdata”,然后恢复TT获取代码。

一共抓取了8934条产品信息和17万条评论(购买)记录。

产品最多的品牌

先分析8934个产品，看下哪个品牌的TT在京东上卖得最多。由于品牌过多，京东上销售TT的品牌就有299个，我们只取卖得最多的前10个品牌。

从上面的图可以看出，排名第1的是杜杜，冈本次之，邦邦第3，前10品牌分别是杜蕾斯、冈本、杰士邦、倍力乐、名流、第六感、尚牌、赤尾、诺丝和米奥。这10个品牌中有5个是我没见过的，分别是倍力乐、名流、尚牌、赤尾和米奥，其他的都见过，特别是杜杜和邦邦常年占据各大超市收银台的醒目位置。

这10个品牌中，杜蕾斯来自英国，冈本来自日本，杰士邦、第六感、赤尾、米奥和名流是国产的品牌，第六感是杰士邦旗下的一个避孕套品牌；倍力乐是中美合资的品牌，尚牌来自泰国，诺丝是来自美国的品牌。

代码：

import pymongo 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
from pandas import DataFrame,Series
client = pymongo.MongoClient(host='localhost',port=27017) 
db = client.condomdb
condom_new = db.condom_new
cursor = condom_new.find() 
condom_df = pd.DataFrame(list(cursor)) 
brand_name_df = condom_df['brand_name'].to_frame()
brand_name_df['condom_num'] = 1
brand_name_group = brand_name_df.groupby('brand_name').sum()
brand_name_sort = brand_name_group.sort_values(by='condom_num', ascending=False)
brand_name_top10 = brand_name_sort.head(10)
# print(3 * np.random.rand(4))
index_list = []
labels = []
value_list = []
for index,row in brand_name_top10.iterrows():
  index_list.append(index)
  labels.append(index)
  value_list.append(int(row['condom_num']))
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号

series_condom = pd.Series(value_list, index=index_list, name='')
series_condom.plot.pie(labels=labels,
         autopct='%.2f', fontsize=10, figsize=(10, 10))

卖得最好的产品

可以根据产品评价数量来判断一个产品卖得好坏，评价数最多的产品通常也是卖得最好的。

产品评论中有个产品评论总数的字段，我们就根据这个字段来排序，看下评论数量最多的前10个产品是什么（也就是评论数量最多的）。