Eating Preference Analysis Based on Human Personality Traits
The Engineering & Research Challenge: To quantify the relationship between latent personality traits and physical world behavior (dining choices). The technical barrier was acquiring and cleaning massive datasets from rate-limited social media platforms to build a statistically significant sample size for psychographic modeling.
Data Engineering & Acquisition Infrastructure:
- Distributed Scraping Pipeline:
- Architected a robust scraper using Selenium and BeautifulSoup to harvest 55,000 Foursquare check-ins and ~2 million tweets.
- Rate-Limit Mitigation: Implemented a rotating proxy network and request throttling to bypass strict API rate limits, ensuring continuous data ingestion without IP bans.
- Cost-Optimized API Integration:
- Integrated IBM Watson Personality Insights API to extract Big-Five personality traits for 7,800 unique users.
- Optimization: Engineered a caching layer and smart back-off strategy that reduced redundant API calls, cutting project overhead costs by ~45%.
NLP & Statistical Inference:
- Noise Reduction Pipeline:
- Developed automated cleaning scripts using spaCy and NLTK to strip noise (emojis, URLs) from the raw text corpus.
- Result: Reduced dataset noise by ~45%, significantly improving the signal-to-noise ratio for downstream analysis.
- Correlation Analysis:
- Conducted rigorous statistical testing (Pearson/Spearman) to validate the behavioral taxonomy.
- Key Finding: Established a statistically significant correlation (ρ = .41, p < 0.01) between high Extraversion scores and a preference for high-cost dining establishments.
Technologies Leveraged: Python, IBM Watson API, Selenium, BeautifulSoup, scikit-learn (Pearson/Spearman), spaCy, NLTK, Pandas, Docker, Rotating Proxies.
