🔥 Reddit Scraper with Bittensor: A POC for Incentive-Driven Data Acquisition

Overview

In this blog post, we introduce a Proof of Concept (POC) that combines the functionality of Reddit scraping with Bittensor, a blockchain network designed to incentivize the creation of digital commodities. This project aims to showcase the integration of Bittensor's self-contained incentive mechanisms, known as subnets, with a Reddit scraping service.

The primary goal of this project is to demonstrate how Bittensor subnets can be utilized to incentivize the creation of value within the context of a Reddit scraping service. The functionality of this POC is limited to searching and retrieving metadata from subreddit posts and some top-level comments for each post.

What is Bittensor?

In a nutshell, Bittensor is a blockchain platform that hosts multiple self-contained incentive mechanisms known as subnets. These subnets act as arenas where subnet miners generate value, and subnet validators establish consensus.This collaboration determines the fair distribution of TAO tokens, incentivizing the creation of digital commodities like intelligence or data within each subnet.

Each subnet comprises subnet miners and subnet validators, interacting with each other through a specific protocol that forms part of the incentive mechanism. The Bittensor API facilitates this interaction between subnet miners, subnet validators, and Bittensor's on-chain consensus engine called Yuma Consensus. The Yuma Consensus is specifically designed to foster agreement among subnet validators and subnet miners regarding value creation and its corresponding worth within the ecosystem.

Prerequisites

Python >=3.10: Using a lower python version is possible, but it may require some modifications, as we make use of match/case statements which were introduced on this version.
Linux environment: At the time of writing this, the prerequisite instructions for running bitensor are only available for Linux, and although this may work on other environments it is not guaranteed, if you don’t have a Linux installation, you may want to use a VM, WSL2 on Windows is also a great alternative.
Bittensor: We need to install the bittensor package for python, for this POC I’m using the version 6.8.2.
Install Substrate Dependencies: Begin by installing the required dependencies for running a Substrate node:

sudo apt update

sudo apt install --assume-yes make build-essential git clang curl libssl-dev llvm libudev-dev protobuf-compiler

Install Rust and Cargo: Rust is the programming language used in Substrate development. Cargo is Rust's package manager

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source "$HOME/.cargo/env"

Clone the Subtensor Repository: Fetch the subtensor codebase to your local machine, we will use this to run our local test environment (We just have to do this step once, after which we just run the last command to star the Subtensor)

git clone https://github.com/opentensor/subtensor.git
./subtensor/scripts/init.sh
cargo build --release --features pow-faucet
BUILD_BINARY=0 ./scripts/localnet.sh

Reddit’s API envs: we will also need an API client token and a secret, we can get this by registering in Reddit and then creating an app here: https://reddit.com/prefs/apps/ I suggest moving these to an .env file and using a virtual environment such as pipenv to load them dynamically, otherwise you may load them manually using these commands:

export CLIENT_ID=<your_client_id>
export CLIENT_SECRET=<your_client_secret>

UILD_BINARY=0 ./scripts/localnet.sh

Set up wallets and faucet tokens: Finally we will need to set up a wallet and faucet tokens, luckily bittensor offers a guide on how we can achieve this, which also includes the step where we register our subnet in a local environment so we can test it.

Implementation

Now we get into the juicy part, for this POC we are going to use the official subnet template, in this template we will have, among many other things, 3 very important files which we can modify in order to create our own subnet and incentive mechanism. These files are:

template/protocol.py: Contains the definition of the protocol used by subnet miners and subnet validators.
neurons/miner.py: Script that defines the subnet miner's behavior, i.e., how the subnet miner responds to requests from subnet validators.
neurons/validator.py: This script defines the subnet validator's behavior, i.e., how the subnet validator requests information from the subnet miners and determines the scores.

All the code for this POC will be available on Github.

For our template/protocol.py we will create a simple class which will represent our input parameters and when the request is fulfilled it will also contain the output, this object is called a Synapse:

class RedditProtocol(bt.Synapse):
    # Required request input, filled by sending dendrite caller.
    subreddit: str
    # Optional request input, filled by sending dendrite caller.
    sort_by: Optional[Literal['hot', 'new', 'rising', 'random_rising']] = 'new'
    limit: Optional[int] = 10

    # Optional request output, filled by recieving axon.
    output: Optional[List[dict]] = None

For this example we will keep it simple, we take 1 required parameter which is the subreddit we are going to scrape and two optional parameters which are the sorting field and the amount of posts that we want as a result of the scraping.

In our neurons/miner.py we find the main logic on how we handle the requests. At its core, mining in the context of Bittensor refers to the process of contributing computational resources to validate transactions, secure the network, and earn rewards in return. Unlike traditional mining in Proof-of-Work (PoW) blockchains like Bitcoin, where miners solve complex mathematical puzzles, Bittensor's mining is intricately tied to the validation and consensus mechanisms within its unique ecosystem. For now, we will only focus on the forward method:

from reddit_data import process_reddit
/.../
class Miner(BaseMinerNeuron):
/.../
    async def forward(self, synapse) -> template.protocol.RedditProtocol:
        posts = process_reddit(synapse.subreddit, synapse.sort_by, synapse.limit)
        synapse.output = posts
        return synapse

Here we take the synapse as an argument, get the parameters and pass them to a function which handles the scraping, the output is then updated on our synapse and returned. The process_reddit function goes as follows:

import praw
from praw.models import Comment
/.../
reddit_client = praw.Reddit(
    client_id=os.getenv('CLIENT_ID'),
    client_secret=os.getenv('CLIENT_SECRET'),
    user_agent="Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0",
)
/.../
def process_reddit(subreddit_name: str, sort_by: str = "new", limit: int = 10):
    start_time = datetime.now()

    try:
        subreddit = reddit_client.subreddit(subreddit_name)
        match sort_by:
            case 'hot':
                result = [submission_to_dict(submission) for submission in subreddit.hot(limit=limit)]
            case 'new':
                result = [submission_to_dict(submission) for submission in subreddit.new(limit=limit)]
            case 'rising':
                result = [submission_to_dict(submission) for submission in subreddit.rising(limit=limit)]
            case 'random_rising':
                result = [submission_to_dict(submission) for submission in subreddit.random_rising(limit=limit)]

        bt.logging.success(
            f'Process finished. Elapsed {(datetime.now() - start_time)}.'
        )
    except praw.exceptions.NotFound:
        bt.logging.error("Subreddit not found", subreddit_name)

    return result

There is a LOT to unpack here, so bear with me:

First we initialize our API Wrapper, which is praw with our Reddit’s credentials.
After that we declare a function which then initializes a timer we later use for logging purposes.
We then use a match/case statement to determine which method to use for scrapping
We store the result using a list comprehension expression.
Additionally we use a function called submission_to_dict which serializes our response from the API into a python dictionary.

def submission_to_dict(submission) -> dict:
    return {
        "author": submission.author.name if submission.author else "Anonymous",
        "author_flair_text": submission.author_flair_text,
        "clicked": submission.clicked,
    }

We are almost ready, all we just need is to make a client to start making requests to our subnet miners, create a template/client.py:

async def query_synapse(subreddit, category, limit, uid, wallet_name, hotkey, network, netuid):
    syn = RedditProtocol(
        subreddit=subreddit,
        sort_by=category,
        limit=limit,
    )

    # create a wallet instance with provided wallet name and hotkey
    wallet = bt.wallet(name=wallet_name, hotkey=hotkey)

    # instantiate the metagraph with provided network and netuid
    metagraph = bt.metagraph(
        netuid=netuid, network=network, sync=True, lite=False
    )

    # Grab the axon you're serving
    axon = metagraph.axons[uid]

    # Create a Dendrite instance to handle client-side communication.
    dendrite = bt.dendrite(wallet=wallet)

  async def main():
        responses = await dendrite(
            [axon], syn, deserialize=False, streaming=False, timeout=30
        )

        print(responses)

    # Run the main function with asyncio
    await main()

This is less intimidating than what it looks like, essentially we first instantiate our synapse with the necessary arguments so that the Miner can later fulfill our request, followed by some boilerplate code which instantiates the necessary parameters (such as the wallet, metagraph and axon) to create a dendrite.

Before we continue, let’s understand, what are dendrites and axons? The bittensor documentation explains “Axon is a server instance. Hence a subnet validator will instantiate a dendrite client on itself to transmit information to axons that are on the subnet miners." You may think of dendrites as a distribution center, where it will distribute our synapse objects to axons, our Miner will then instantiate its own axon in order to receive and process our synapse.

As for metagraph it is a very useful object which contains metadata information about the subnet such as registered miners, validators and axons, you may inspect a metagraph instance without participating in a subnet (which can be useful to get information on the mainnet or testnet).

Conclusion

This blog post introduced a Proof of Concept (POC) project that combines Reddit scraping functionality with Bittensor subnets. While this POC is limited in functionality, it serves as a foundation for future developments and expansions in integrating Bittensor with various other services and applications.

‍

About the Author:

Daniel Rodriguez, Software Engineer at Azumo, specializes in backend systems, automation, and cloud services, with expertise in AWS, SQL, and test-driven development.

Text Link Text Link