Problem Statement as being a data scientist for the marketing division at reddit.

i must discover the many predictive key words and/or expressions to accurately classify the the dating advice and relationship advice subreddit pages so we may use them to ascertain which ads should populate for each web page. Since this is a category problem, we’ll utilize Logistic Regression & Bayes models. Misclassifications in this full situation could be fairly safe and so I will utilize the precision score and set up a baseline of 63.3per cent to rate success. Making use of TFiDfVectorization, I’ll get the function value to find out which words have actually the greatest forecast energy for the goal variables. If effective, this model may be utilized to focus on other pages which have comparable regularity associated with exact same terms and expressions.

Data Collection

See relationship-advice-scrape and dating-advice-scrape notebooks because of this component.

After switching most of the scrapes into DataFrames, they were saved by me as csvs that you can get within the dataset folder with this repo.

Information Cleaning and EDA

dropped rows with null self text column becuase those rows are worthless in my experience.
combined name and selftext column directly into one new columns that are all_text
exambined distributions of word counts for games and selftext column per post and contrasted the 2 subreddit pages.

Preprocessing and Modeling

Found the baseline precision rating 0.633 which means that if i usually select the value that develops most frequently, i will be appropriate 63.3% of that time.

First effort: logistic regression model with default CountVectorizer paramaters. train rating: 99 | test 75 | cross val 74 Second attempt: tried CountVectorizer with Stemmatizer preprocessing on first group of scraping, pretty bad score with a high variance. Train 99%, test 72%

attempted to decrease maximum features and rating got a whole lot worse
tried with lemmatizer preprocessing instead and test score went as much as 74per cent

Just increasing the information and y that is stratifying my test/train/split increased my cvec test score to 81 and cross val to 80. Incorporating 2 paramaters to my CountVectorizers helped a great deal. A min_df of 3 and ngram_range of (1,2) increased my test score to 83.2 and get a get a get a cross val to 82.3 Nevertheless, these rating disappeared.

I do believe Tfidf worked the most effective to diminish my overfitting due to variance issue because

we customized the end words to just take the ones away which were really too regular to be predictive. This is a success, nevertheless, with additional time we most likely could’ve tweaked them much more to boost all ratings. Taking a look at both the single terms and words in categories of two (bigrams) had been the most useful param that gridsearch recommended, nonetheless, each of my top many predictive terms finished up being uni-grams. My list that is original of had a good amount of jibberish terms and typos. Minimizing the # of that time period word was needed to show as much as 2, helped be rid of these. Gridsearch additionally recommended 90% max df rate which assisted to eradicate oversaturated terms aswell. Finally, establishing max features to 5000 reduced cut down my columns to about one fourth of whatever they had been to just focus the absolute most frequently employed terms of that which was kept.

Summary and tips

Also I was able to successfully lower the variance and there are definitely several words that have high predictive power though I would like to have higher train and test scores

therefore I think the model is prepared to introduce a test. The same key words could be used to find other potentially lucrative pages if advertising engagement increases. I discovered it interesting that taking out fully payday loans in Oregon the overly used terms aided with overfitting, but brought the precision rating down. I believe there is certainly probably nevertheless space to relax and play around with the paramaters of this Tfidf Vectorizer to see if various end terms produce an or that is different

About

Used Reddit’s API, needs collection, and BeautifulSoup to clean posts from two subreddits: Dating information & union guidance, and trained a classification that is binary to anticipate which subreddit confirmed post originated from

the blog