Sentiment-based Stock Prediction: WallStreetBets Case StudyÂ
Introduction:
This data product explores the use of machine learning techniques to analyze Reddit posts from the /WallStreetBets community and predict the fluctuations in the GME stock price. By employing both supervised and unsupervised learning methods, we uncover insights into the impact of these posts on stock prices, enabling better-informed trading decisions and the potential for improved investment strategies.
Problem Statement:
The goal of this data product is to develop and compare different machine learning models capable of predicting stock price fluctuations based on sentiment analysis of Reddit posts from the /WallStreetBets community. This would provide valuable insights for investors and traders interested in understanding the relationship between social media sentiment and stock market movements.
Data Collection:
Data was collected from Reddit's /WallStreetBets community, focusing on posts related to GME stock. The dataset includes post titles, post bodies, and stock price adjustments corresponding to the next day. This comprehensive dataset allows for a thorough analysis of the relationship between online discussions and stock price changes.
Methodology:
The methodology consists of three main parts, each crucial to building accurate and effective models for stock price prediction:
Data Preprocessing and Feature Engineering:
Text cleaning and preprocessing to remove noise and irrelevant information
Text vectorization using Bag of Words (BOW) and TF-IDF techniques to represent text data numerically
Dimensionality reduction using PCA to simplify the feature space without losing significant information
Unsupervised Learning:
K-Means Clustering to identify patterns and trends in the data
Naive Bayes to model the likelihood of stock price changes based on post sentiment
Evaluation of model performance using qualitative analysis, considering the difficulty of quantifying unsupervised learning results
Supervised Learning:
Logistic Regression to model the relationship between post sentiment and stock price changes
Support Vector Machine to classify stock price changes based on post sentiment
Random Forest to build an ensemble of decision trees for accurate classification
Evaluation of model performance using accuracy, confusion matrix, and classification report to determine the most effective methods
Results:
The study found that the K-Means Clustering model showed impressive predictive ability, with the sentiment of Reddit posts often corresponding to stock price changes several days ahead of time. On the other hand, Naive Bayes performed poorly in comparison. In supervised learning, Random Forest with BOW and title aggregation stood out, achieving approximately 83% accuracy.
Conclusion:
Both supervised and unsupervised learning methods have shown promise in predicting stock price fluctuations based on the sentiment analysis of Reddit posts from the /WallStreetBets community. Further refinement in preprocessing pipelines, word embedding methods, and training/classification resolution labeling may lead to improvements in the models' predictive abilities, making them even more useful for investment decision-making.
Applications:
This data product can be used by traders, investors, and financial institutions to gain insights into the impact of social media sentiment on stock prices. By understanding these relationships, users can make better-informed trading decisions, develop more effective investment strategies, and manage risk more efficiently.
Challenges & Future Work:
Improving text preprocessing and feature engineering techniques for better model performance
Exploring alternative text vectorization methods to capture more nuanced relationships in the data
Incorporating other external factors and data sources, such as macroeconomic indicators and news articles, for more comprehensive analysis
Investigating the applicability of these models to other stocks and financial markets to determine their broader usefulness and adaptability