哔哩哔哩爬取器:以个人为中心

Overview

Open Bilibili Crawer

哔哩哔哩是一个信息非常丰富的社交平台,我们基于此构造社交网络。在该网络中,节点包括用户(up主),以及视频、专栏等创作产物;关系包括:用户之间,包括关注关系(following/follower),回复关系(评论区),转发关系(对视频or动态转发);用户对创作物,包括评论关系(包括评论文本),发弹幕关系(包括弹幕文本),点赞、投币关系等。创作物之间的关系也可以人为构建,比如所属同一类别分区,拥有50%以上的相同tag等。

综上,哔哩哔哩网络是一个信息丰富的异质网络。

我们尝试以一个人为中心,去爬取他的个人信息与创作物信息。通过对指定的一群人进行信息的爬取,我们就可以得到一张信息丰富的异质网络。所以,OBC(Open Bilibili Crawer)的输入是一个用户,或一组用户的id(mid)。

OBC目前封装了三类爬虫:==关系爬虫、个人信息爬虫和视频爬虫==。关系爬虫负责通过默认的following关系爬取用户之间的关系,构建基础用户网络;个人信息爬虫负责抓取用户尽量多的有价值信息,提供用户节点属性;视频爬虫负责爬取视频相关的文本信息与统计类信息,可以进一步丰富用户节点属性,也可以构建异质视频节点。

OBC构建的异质网络

1. 关系爬虫

由于B站限制,对自己以外的用户最多只能浏览100个关注者/粉丝,所以关系爬虫对每个用户最多爬取100个他的关注者。对于大V来说,关注者数量通常远小于粉丝的数量,所以这种采样方法可以尽量减少网络结构的偏差。该爬虫返回三元组的列表,字段包括:

字段名 含义 备注
from_nd 中心用户mid
to_nd 中心用户关注的用户mid
rel_type 默认为“following”

这样的含义是:from_nd关注了to_nd。

2. 个人信息爬虫

个人信息分散在多个接口中,我们爬取的个人信息字段包括:

字段名 含义 备注
nfollowing 关注者数量
nfollower 粉丝数量
uname 用户名
sex 性别 男 女 保密
sign 个人简介
level b站等级 0-6的整数
official 官方头衔 以、分隔的字符串,分隔后每个元素都是一个头衔
birthday 生日 MM-DD格式,如01-01
school 学校
profession 职业
video_view 视频总播放量
article_view 专栏总阅读量
nlike 总点赞数

还有视频投稿数等一些指标没有集成进来。不过这些应该足够作为节点属性了。

3. 视频爬虫

该爬虫首先找到个人最近投稿的最多50条视频,然后对每条视频抓取一些文本信息和统计量。视频条数可以扩充。

文本信息包括视频所属类别(typeid)和视频的标签(tag 、tid)。爬虫还会存储所有遇见过的标签的信息,包括标签的题目、tid、关注该标签的人数、使用过该标签的人数等。此外,只需要建立typeid和标签的关联就可以大致判断出typeid代表的分区类型。统计量包括视频播放量等一系列数值。此外,接口还提供了视频时长、视频发布时间等更多的指标,这些并没有集成进来。

目前,每个人的视频信息包括:

    "mid" : 359797,      //mid
    "video_type" : {     //对视频所属种类的统计,视频种类以typeid代表
        "138" : 44,
        "21" : 4,
        "240" : 1,
        "28" : 1
    },
    "video_tag" : {     //对视频的标签出现频次按照降序排列,最多存50个,标签以标签id(tid)代表,可以在全局存标签信息的数据中查找到对应的标签名
        "1711163" : 38,
        "1833" : 17,
        "7662089" : 11,
        "6497596" : 10,
        "13926" : 9,
        ...
        "19327" : 1,
        "34356" : 1
    },
    "video_stat" : [   // 列表,每个元素都是视频的一些统计信息,包括8个指标
        {
            "aid" : 848235319,
            "bvid" : "BV1ZL4y1872w",
            "view" : 84679,
            "danmaku" : 107,  // 弹幕数
            "reply" : 273,
            "favorite" : 299,
            "coin" : 302,
            "share" : 309,
            "like" : 5082,
            "his_rank" : 0  // 0以外越小越好
        }, 
        {
            "aid" : 848011693,
            "bvid" : "BV1aL4y1a77s",
            "view" : 3128993,
            "danmaku" : 1767,
            "reply" : 4511,
            "favorite" : 33329,
            "coin" : 21221,
            "share" : 46047,
            "like" : 177489,
            "his_rank" : 34
        }, 
        ...
        ]

可以这样利用视频信息:

  1. 对每个人取topK个标签,把标签编码为向量后作为用户节点的属性之一
  2. 取每个人的视频所属类别最多的那个类别(typeid)作为用户节点的标签,看成K类别的多分类问题
  3. 视频的统计量,每个指标取sum/mean/max作为用户节点属性的一个维度

每个标签的信息如下:

{
	"tid" : 1767558,
    "tag_name" : "VLOG日常",
    "subscribe" : 5225,  // 关注数
    "use" : 1447949,     // 使用数
    "feature" : 0
}

4. Future Work

OBC建立在已圈定一批用户的基础上,对这批用户构造信息丰富的网络结构。如何圈定用户不在OBC职能之内。

未来工作包括:

  1. 构造异质节点和边:目前虽然可以构造视频节点,但用户和视频之间只有”发布视频“一种关系,还没有办法增加其他”用户--视频“关系如点赞、评论等。
  2. 本网络能服务于哪些下游任务?需要我们和看到此项目的各位一同思考。
  3. 增强OBC性能:添加代理、多线程等。
Owner
Boshen Shi
Devoted to my true belief
Boshen Shi
Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

Scrapy project 45.5k Jan 07, 2023
SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features.

SearchifyX SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features. SearchifyX lets you

28 Dec 20, 2022
A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

GNews 🚩 A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response 🚩 As well as you can fetch full

Muhammad Abdullah 273 Dec 31, 2022
一些爬虫相关的签名、验证码破解

cracking4crawling 一些爬虫相关的签名、验证码破解,目前已有脚本: 小红书App接口签名(shield)(2020.12.02) 小红书滑块(数美)验证破解(2020.12.02) 海南航空App接口签名(hnairSign)(2020.12.05) 说明: 脚本按目标网站、App命

XNFA 90 Feb 09, 2021
Creating Scrapy scrapers via the Django admin interface

django-dynamic-scraper Django Dynamic Scraper (DDS) is an app for Django which builds on top of the scraping framework Scrapy and lets you create and

Holger Drewes 1.1k Dec 17, 2022
a way to scrape a database of all of the isef projects

ISEF Database This is a simple web scraper which gets all of the projects and abstract information from here. My goal for this is for someone to get i

William Kaiser 1 Mar 18, 2022
Python scraper to check for earlier appointments in Clalit Health Services

clalit-appt-checker Python scraper to check for earlier appointments in Clalit Health Services Some background If you ever needed to schedule a doctor

Dekel 16 Sep 17, 2022
Scrape Twitter for Tweets

Backers Thank you to all our backers! 🙏 [Become a backer] Sponsors Support this project by becoming a sponsor. Your logo will show up here with a lin

Ahmet Taspinar 2.2k Jan 05, 2023
Python script who crawl first shodan page and check DBLTEK vulnerability

🐛 MASS DBLTEK EXPLOIT CHECKER USING SHODAN 🕸 Python script who crawl first shodan page and check DBLTEK vulnerability

Divin 4 Jan 09, 2022
一个m3u8视频流下载脚本

一个Python的m3u8流视频下载脚本 介绍 m3u8流视频日益常见,目前好用的下载器也有很多,我把之前自己写的一个小脚本分享出来,供广大网友使用。写此程序的目的在于给视频下载爱好者提供一个下载样例,可直接调用,勿再重复造轮子。 使用方法 在python中直接运行程序或进行外部调用 import

Nchu 0 Oct 10, 2021
Python scrapper scrapping torrent website and download new movies Automatically.

torrent-scrapper Python scrapper scrapping torrent website and download new movies Automatically. If you like it Put a ⭐ on this repo 😇 Run this git

Fazil vk 1 Jan 08, 2022
This is a sport analytics project that combines the knowledge of OOP and Webscraping

This is a sport analytics project that combines the knowledge of Object Oriented Programming (OOP) and Webscraping, the weekly scraping of the English Premier league table is carried out to assess th

Dolamu Oludare 1 Nov 26, 2021
Scrapy-based cyber security news finder

Cyber-Security-News-Scraper Scrapy-based cyber security news finder Goal To keep up to date on the constant barrage of information within the field of

2 Nov 01, 2021
This code will be able to scrape movies from a movie website and also provide download links to newly uploaded movies.

Movies-Scraper You are probably tired of navigating through a movie website to get the right movie you'd want to watch during the weekend. There may e

1 Jan 31, 2022
A repository with scraping code and soccer dataset from understat.com.

UNDERSTAT - SHOTS DATASET As many people interested in soccer analytics know, Understat is an amazing source of information. They provide Expected Goa

douglasbc 48 Jan 03, 2023
Scrape puzzle scrambles from csTimer.net

Scroodle Selenium script to scrape scrambles from csTimer.net csTimer runs locally in your browser, so this doesn't strain the servers any more than i

Jason Nguyen 1 Oct 29, 2021
Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

BOM Quote Manufacturing 212 Nov 05, 2022
A simple django-rest-framework api using web scraping

Apicell You can use this api to search in google, bing, pypi and subscene and get results Method : POST Parameter : query Example import request url =

Hesam N 1 Dec 19, 2021
Web Scraping images using Selenium and Python

Web Scraping images using Selenium and Python A propos de ce document This is a markdown document about Web scraping images and videos using Selenium

Nafaa BOUGRAINE 3 Jul 01, 2022
Web-Scraping using Selenium Master

Web-Scraping using Selenium What is the need of Selenium? Some websites don't like to be scrapped and in that case you need to disguise your webscrapi

Md Rashidul Islam 1 Oct 26, 2021