Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Last update: Dec 06, 2021

Related tags

Data Analysis kafka-to-spark-streaming

Overview

Using Streaming Twitter Data with Kafka and Spark

Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream API and Spark Streaming

Make sure that VPN is switched on, so that you can use Twitter. In some countries Twitter is blocked.

Moreover, you should have own consumer_key, consumer_secret, and access_token with its secret inside config.py file

Create environment using conda with Python 3.8:
- conda create -n python38 python=3.8
- conda activate python38
- Check requirements inside requirements.txt and install then using conda:
  - conda install -c conda-forge tweepy==4.4.0
  - conda install -c conda-forge kafka-python==2.0.2
Kafka should be installed in your machine, check the documentation for installation. if you use brew with Mac you can use brew install kafka
Start zookeeper: zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties, port: 2181
On another terminal window start broker: kafka-server-start /usr/local/etc/kafka/server.properties, port: 9092 - In terminal window list topics you have: kafka-topics --list --bootstrap-server localhost:9092
Create Kafka topic "tweeter" with 1 partition and no replication because we use local machine: kafka-topics --create --topic tweeter --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
Now list again, the topics you have: kafka-topics --list --bootstrap-server localhost:9092
Let's see what we have inside the "tweeter" topic kafka-console-consumer --bootstrap-server localhost:9092 --topic tweeter --from-beginning, absolutely noting), but when we start streaming, data will be generated
Now run python kafka_producer.py to start stream Twitter and push message to topic.
And now check that the data is inside topic with kafka-console-consumer --bootstrap-server localhost:9092 --topic tweeter --from-beginning
Congrats! You have done it!

So what's next?

You can use generated data with Kafka Stream and Spark Streaming, and practice more!

Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Related tags

Overview

Using Streaming Twitter Data with Kafka and Spark

Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream API and Spark Streaming

Owner

Rustam Zokirov

In this tutorial, raster models of soil depth and soil water holding capacity for the United States will be sampled at random geographic coordinates within the state of Colorado.

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

Business Intelligence (BI) in Python, OLAP

Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).

A 2-dimensional physics engine written in Cairo

follow-analyzer helps GitHub users analyze their following and followers relationship

An ETL framework + Monitoring UI/API (experimental project for learning purposes)

A library to create multi-page Streamlit applications with ease.

Data pipelines built with polars

The repo for mlbtradetrees.com. Analyze any trade in baseball history!

cLoops2: full stack analysis tool for chromatin interactions

BioMASS - A Python Framework for Modeling and Analysis of Signaling Systems

TheMachineScraper 🐱‍👤 is an Information Grabber built for Machine Analysis

pipeline for migrating lichess data into postgresql

Evaluation of a Monocular Eye Tracking Set-Up

ASOUL直播间弹幕抓取&&数据分析

Pipeline and Dataset helpers for complex algorithm evaluation.

apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.

Generates a simple report about the current Covid-19 cases and deaths in Malaysia