Twitterverse’s opinion of NASA Research

tldr: NASA conducts research on topics ranging from human spaceflight to climate change. These different topics are assigned different amount of money each year in the federal budget.  Is the ranking of  topic-wise budget allocations similar to the popularity rankings of these topics amongst Twitterverse, with most popular topics also being the most funded topics? Or, does Twitterverse not really care and equally love all things “spacey”?

The training data was obtained from mining NASA pdf documents that detail what projects come under each of the research topics. Word embeddings  (paragraph2Vec) was used to convert this text into feature space, using which, a logistic regression classifier was trained to allocate topics to texts. Subsequently, the NASA tweets were fed to this classifier to assign topics to each of the tweets. The  favorites and retweets were grouped by topic and were tested for statistical significance of their differences from one another.

It was found that not all things “spacey” were treated the same by twitterverse. Science topics, particularly science of distant worlds (Astrophysics) captured the imagination of the tweet readers the most, since they exhibited higher probability of being retweeted and favorited than most of their peers. However, there was no similarity found between  funding trends and Twitterverse popularity.

Background

NASA works on several subjects that range from climate change on Earth  to the study of distant stars and galaxies. In addition, it is also involved in the engineering of technologies that enable these scientific studies. These research activities are clustered into different areas or topics such as Planetary Science, Space Exploration, Aeronautics, etc. 

Being a federal agency,  NASA receives the funding to carry out its activities from the President’s annual federal budget. This budget not only decides how much money NASA gets every fiscal year, but also how this money is distributed across all of its research areas (topics) . Since the progress made by NASA (as well as its contractors such as academic and research institutions) in each of the research topics, is linked to the funding  allocated to that topic, this budget breakup can be a contentious issue. 

NASA budget for the year 2012 divided across its various topics/programs. (Sources: NAP)

Broadly speaking, the breakup of the NASA budget across its research topics is driven by the US political climate, the scientific urgency of solving a certain research problem, and other factors that govern the overall US federal budget. Like all federal spending, the taxpayer’s opinion (public opinion) should be a factor in this decisions.

Currently, there are several studies on the US public opinion in relation to  NASA. These studies  concentrate on finding out answers to questions such as: “What is the public opinion of NASA?”,” Do we spend enough money  on NASA?”,”Should we send humans to Mars?”, etc.  However, there is a void when it comes to studies that focus on the difference in the public opinion of NASA research topics. In particular, I could not find any studies that concentrate on finding if the public opinion of different topic differed, or, if “all things space” are equally favored by the taxpayers.  Driven by this lack of information, I was interested in conducting a back-of-the-envelope study that could paint a picture of the public perception of the NASA research topics and inspect if there is a difference in the popularity of different topics.  

I turned to NASA social media data to conduct this study, choosing responses to NASA twitterfeed as a proxy for  the public opinion. In the rest of this post, I’ll talk about how I used NASA twitter data to extract Twitterverse’s opinion of different NASA topics.

Objective:

The main goal was to investigate if the responses to NASA tweets were different for tweets of different research topics. NASA research is divided into the following topics for the purpose of the federal fund allocation: (1) Science (2) Exploration (3) Space Operations (4) Space Technology (5) Aeronautics (6) Mission Services (7) Education (8) Construction and environmental compliance (9) Inspector general.

I was interested in finding out if the response to NASA tweets expressed in terms of favorites and retweets, when  grouped into these topics, were different from one another. If they were all indeed different from one another, I was interested in finding if the ranking of these topics by popularity, bore any semblance to the ranking of topics by the funding allocated to them. 

Scope and Limitations

Traditional opinion polls are often comprised of small sample sizes and require financial resources for collecting the polling data. On the other hand, social media data is available free of cost and comprises large volumes of data. These factors have contributed to the rise in the practice of  mining Twitter, Facebook and other social media platforms for curating insights and opinions. However, this comes at the cost of having to deal with uncontrolled variables. For instance, demographics and population groups are not always available for the data mined from these social platforms.  Hence these studies run into the risk of inducing selection bias in the curated opinion, if the right caveats and boundaries are not considered while drawing conclusions.

In this study, the demographics of both NASA followers and accounts that respond to a NASA tweet, were unavailable through Twitter API. Hence I could not filter responses to tweets by geographical location and get responses from US based accounts alone. One could argue that US has the largest number of Twitter users and that NASA, being a US federal agency, should elicit more US users responses, therefore Twitterverse’s opinion could be approximated as US public opinion. However, not knowing other demographic information like age , race, gender etc. of NASA twitter followers meant  I could still run into the risk of my analysis being skewed towards a certain race, age etc.  For instance, it was found that US Twitterverse in 2016 comprised of more  college educated, urban young adults users than other demographic groups. Hence, the scope of the study was limited in its interpretation as being Twitterverse’s opinion of different NASA topics and not as US public opinion or global public opinion

Another point to keep in mind is that this study makes the default assumption that a Twitter user’s online behavior, replicates their real life concerns and interests. It is assumed that a retweet or a favorite from a user corresponds to the fact that user has interest in the topic of the tweet. 

Sidebar:

  • Currently, there are no published journal articles or articles released by the company Twitter, which talk about Twitter users by country.  However all of the articles by market research firms that release social media statistics, point to the US as the country with the highest number of Twitter accounts (total users not per capita users). Here is a link to one such article on Forbes that looked at 2014 Twitter user data. Here are some more. [1,2].
  • The Twitter REST API does not provide demographics of either users who respond to a certain tweet or of followers of an account . However, this can be only be obtained by admins of an account through Twitter analytics.  
  • The demographics of US Twitterverse was conducted by Pew Research Center.  Head over here to see what else this study by Pew Research Center found about the Twitterverse and other social media platform user demographics.
  • It is also worth mentioning that I checked the data available through the Facebook API for conducting this study.  Although FB’s API did not have the 3200 post limit like Twitter did, it did not provide any demographics of people who respond to posts on a page.  Hence the problem of uncontrolled variables would have continued to exist, had I chosen  posts from NASA Facebook page for this study.
  • I haven’t explored any paid data warehousing services. The demographics information could perhaps be easily obtained from them and this study could be extended to extract opinion by nationality and control for other demographic factors as necessary.
  • Since the data did not have any identifiable information of users, this work was not sent for an IRB review.

Methodology:

The first step was extracting the useful information from the NASA tweets obtained using the Twitter API. Once I had filtered out the extraneous content from the tweets, I had to derive a methodology for assigning topics to each of the tweets (the topics are the 9 topics mentioned earlier on in the Objective section). To assign the topics, I tried a whole bunch of supervised and unsupervised algorithms and different ways of sourcing topics to the tweets. Once I had the topics assignments in place, I conducted tests for statistical significance of the difference in the Twitterverse’s response to tweets of different topics.  Twitterverse’s response is measured using the number of favorites and retweets a post gets.

Tools: Twitter API, Python (pandas, GENSIM, NLTK, scikit, scipy,twitter,pdfminer).

Here is an in-depth look into each of the steps and the related code and graphs.

Data Extraction

I used the Twitter REST API  to obtain the text and the auxiliary information of the tweets by NASA . To use the API, a twitter developer account is needed. The API only gives the 3200 most recent tweets, and for longer timelines of an account, paid services which warehouse Twitter data need to be used. In the case of NASA twitter activity, 3200 tweets corresponded to about 6 month history of NASA tweets.  I had extracted  NASA twitter data for the first half of the year 2016 for another project last summer. Combining them with the current batch of data I extracted from the API, I had data for the entire year of 2016.  Python has a wonderful library for  Twitter API that lets you connect to Twitter and get a JSON dump of all the tweets you want to extract. Here is how I used it to extract data.

import numpy as np
import pandas as pd
import json 

# Use Twitter Python API
from twitter import *
# Get the token_key,token_secret,consumer_key,consumer_secret from your Twitter developer account.
twitter=Twitter(auth=OAuth(token_key,token_secret,consumer_key,consumer_secret))
Tweets=twitter.statuses.user_timeline(screen_name='nasa',count=1)
print(json.dumps(Tweets[0],indent=4))
print(json.dumps(Tweets[0]['user'],indent=4)) # This is to clean up the json file
Retweet_User_Name=[]
Retweet_User_Description=[]
Retweet_User_Followers_Count=[]
Tweet_Text=[]
Tweet_Favorite_Count=[]
Tweet_Followers_Count=[]
Tweet_Retweet_Count=[]
Tweet_DateTime=[]
Tweet_Retweet=[]
ID=twitter.statuses.user_timeline(screen_name='nasa',count=1)[0]['id']# Get the max_id of the most recent tweet
for counter in range(int(3200/200)):
    Tweets=twitter.statuses.user_timeline(screen_name='nasa',max_id=ID,count=200)
    for counter in range(200):
        Tweet_Text.append(Tweets[counter]['text'])
        Tweet_Favorite_Count.append(Tweets[counter]['favorite_count'])
        Tweet_Followers_Count.append(Tweets[counter]['user']['followers_count'])
        Tweet_Retweet_Count.append(Tweets[counter]['retweet_count'])
        Tweet_DateTime.append(Tweets[counter]['created_at'])
        
        
        if "retweeted_status" in Tweets[counter]: # This is present only if the user tweet is a retweet of another user.
            Retweet_User_Name.append(Tweets[counter]['retweeted_status']['user']['name'])
            Retweet_User_Description.append(Tweets[counter]['retweeted_status']['user']['description'])
            Retweet_User_Followers_Count.append(Tweets[counter]['retweeted_status']['user']['followers_count'])
            Tweet_Retweet.append(1)
            
        else:
            Retweet_User_Name.append(np.float('NaN'))
            Retweet_User_Description.append(np.float('NaN'))
            Retweet_User_Followers_Count.append(np.float('NaN'))
            Tweet_Retweet.append(0)
    ID=Tweets[counter]['id']  

# DateTime format is - Day-Month-Date-Time-DROP-Year
Tweet_Day=list(map(lambda x: x.split()[0], Tweet_DateTime))
Tweet_Month=list(map(lambda x: x.split()[1], Tweet_DateTime))
Tweet_Date=list(map(lambda x: int(x.split()[2]), Tweet_DateTime))
Tweet_Time=list(map(lambda x: x.split()[3], Tweet_DateTime))
Tweet_Year=list(map(lambda x: int(x.split()[5]), Tweet_DateTime))
# del(Tweet_Collection)
Tweet_Collection = pd.DataFrame({'Text':Tweet_Text, 'DateTime':Tweet_DateTime,'Tweet_Day':Tweet_Day,'Tweet_Month':Tweet_Month,
                                 'Tweet_Date':Tweet_Date,'Tweet_Time':Tweet_Time,'Tweet_Year':Tweet_Year,
                                 'Followers_Count':Tweet_Followers_Count,'Favorite_Count':Tweet_Favorite_Count,'Retweet_Count':
                                 Tweet_Retweet_Count,'Retweet_User_Name':Retweet_User_Name,'Retweet_User_Description'
                                 :Retweet_User_Description,'Retweet_User_Follower_Count':Retweet_User_Followers_Count})
Tweet_Collection = Tweet_Collection[['DateTime','Tweet_Day','Tweet_Month','Tweet_Date','Tweet_Time','Tweet_Year','Text', 
                                     'Followers_Count','Favorite_Count','Retweet_Count','Retweet_User_Name',
                                     'Retweet_User_Description','Retweet_User_Follower_Count']] 
# Re-arrange the column names to be in the order of our preference
writer = pd.ExcelWriter('Twitter_Collection2.xlsx', engine='xlsxwriter')
Tweet_Collection.to_excel(writer, sheet_name='NASA_Tweets_second')
writer.save()

The following information about NASA twitter activity was available  through the API : datetime, text of the tweets, followers count, favorite count, retweet count, retweet user name, retweet user description,  retweet user follower count. The first six variables are the text and the auxiliary information about the tweet . The last four variables are available when NASA retweets a user and corresponds to the information of the user whose tweet NASA retweeted.

Variable Description           
text text of each tweet
followers count number of followers of NASA twitter account
favorite count number of favorites for each tweet
retweet count number of retweets for each tweet

Data Exploration

So, what did the data look like?  

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib  inline
import numpy as np
from IPython.display import display
twitter_data=pd.read_excel('Twitter_DataDump.xlsx', sheetname='NASA_Tweets_2016')
# Get the first few rows of data
twitter_data.head()
  DateTime Tweet_Day Tweet_Month Tweet_Date Tweet_Time Tweet_Year Text Followers_Count Favorite_Count Retweet_Count Retweet_User_Name Retweet_User_Description Retweet_User_Follower_Count
0 2016-04-29 13:59:02 Fri Apr 29 13:59:02 2016 This week on @Space_Station, @Astro_TimPeake r… 17539151 1371 514 NaN NaN NaN
1 2016-04-29 13:17:34 Fri Apr 29 13:17:34 2016 RT @Astro_Jeff: Good Morning! #sunrise https:… 17539151 0 1155 Jeff Williams NASA Astronaut, US Army Colonel, experimental … 81409.0
2 2016-04-29 01:41:50 Fri Apr 29 01:41:50 2016 RT @astro_tim: #GoodNight #Tehran from @Space_… 17539151 0 1336 Tim Kopra Living and working on board the International … 112881.0
3 2016-04-29 00:42:01 Fri Apr 29 00:42:01 2016 Our next #EarthExpedition is targeted at tackl… 17539151 932 399 NaN NaN NaN
4 2016-04-29 00:12:38 Fri Apr 29 00:12:38 2016 RT @NASAJPL: Bright Idea: #CrazyEngineering to… 17539151 0 539 NASA JPL NASA Jet Propulsion Laboratory manages many of… 1001447.0

 

Since the objective was to explore the popularity of tweets using favorite count and retweet count, I was interested to see the tweets which had the maximum favorite count and those which were retweeted the most. Before I did this, I divided the favorite and retweet count by NASA follower count to remove the influence of the variation in the number of  NASA twitter followers through the year (I wanted to make sure increase/decrease in NASA followers are not responsible for the  increase or decrease in number of people retweeting or favoriting tweets of a certain topic).  I was also interested  to see who NASA retweeted the most and also the distribution of the retweets and favorites across all the tweets of my dataset. Related code and output:

# I have some data from 2015 too, so want to filter them out.
twitter_data_2016=twitter_data[twitter_data.Tweet_Year==2016]
# Get the top 10 most liked (favorited) tweets
twitter_data_2016['Retweet_Count_Normalized']=twitter_data_2016['Retweet_Count']/twitter_data_2016['Followers_Count']
twitter_data_2016['Favorite_Count_Normalized']=twitter_data_2016['Favorite_Count']/twitter_data_2016['Followers_Count']
sorted_index=twitter_data_2016['Favorite_Count'].sort_values(ascending=False,inplace=False).index
top10_tweets=twitter_data_2016.Text[sorted_index].values[:10]
print('These are the top 10 most favorited tweets: \n')
print(top10_tweets)
print('\n')
sorted_index=twitter_data_2016['Retweet_Count'].sort_values(ascending=False,inplace=False).index
top10_tweets=twitter_data_2016.Text[sorted_index].values[:10]
print('These are the top 10 most retweeted tweets: \n')
print(top10_tweets)
 
These are the top 10 most favorited tweets: 

[ 'A purple nebula, in honor of Prince, who passed away today. https://t.co/7buFWWExMw https://t.co/ONQDwSQwVa'
 'We are saddened by the loss of Sen. John Glenn, the first American to orbit Earth. A true American hero. Godspeed,… https://t.co/TroDhCAmEe'
 '#1YearOfDragMeDown. See what @OneDirection saw when filming their video at @NASA_Johnson: \nhttps://t.co/4vsTUf1qwO https://t.co/UgTPvSOlMd'
 "Success! Engine burn complete. #Juno is now orbiting #Jupiter, poised to unlock the planet's secrets.  https://t.co/YFsOJ9YYb5"
 "Congratulations to #WorldSeries champions Chicago @Cubs! In your honor, here's a pic of #Chicago from space:… https://t.co/xRzQtLSc5d"
 "Congrats Margaret Hamilton on receiving the #MedalofFreedom today! You helped us make a 'giant leap' on moon landin… https://t.co/Yv0WnZJy4e"
 'He who is not courageous enough to take risks will accomplish nothing in life - #RIPMuhammadAli https://t.co/Ld4CmPyn5Y'
 'And touchdown! Welcome home @StationCDRKelly, officially back on Earth after spending a #YearInSpace. https://t.co/vmfGJfCRpA'
 '#Juno turned back toward the sun, has power and started its tour of #Jupiter in an initial 53.5-day orbit https://t.co/iwRSSOwPwX'
 'Congrats to the @SpaceX team & @ElonMusk! Way to stick the landing & send #Dragon to @Space_Station. https://t.co/TCJCQljJBZ']


These are the top 10 most retweeted tweets: 

[ "RT @NASA_Johnson: Congrats to @onedirection on their #BRITs2016 win for the 'Drag Me Down' video! We are glad to be part of it. 👏👏 https://…"
 'A purple nebula, in honor of Prince, who passed away today. https://t.co/7buFWWExMw https://t.co/ONQDwSQwVa'
 'We are saddened by the loss of Sen. John Glenn, the first American to orbit Earth. A true American hero. Godspeed,… https://t.co/TroDhCAmEe'
 'RT @StationCDRKelly: First ever flower grown in space makes its debut! #SpaceFlower #zinnia #YearInSpace https://t.co/2uGYvwtLKr'
 '#1YearOfDragMeDown. See what @OneDirection saw when filming their video at @NASA_Johnson: \nhttps://t.co/4vsTUf1qwO https://t.co/UgTPvSOlMd'
 "Success! Engine burn complete. #Juno is now orbiting #Jupiter, poised to unlock the planet's secrets.  https://t.co/YFsOJ9YYb5"
 "RT @POTUS: Congrats SpaceX on landing a rocket at sea. It's because of innovators like you & NASA that America continues to lead in space e…"
 "RT @StationCDRKelly: #Thanks for following our #YearInSpace The journey isn't over. Follow me as I rediscover #Earth! See you down below! h…"
 'RT @StationCDRKelly: Massive #snowstorm blanketing #EastCoast clearly visible from @Space_Station! Stay safe! #blizzard2016 #YearInSpace ht…'
 'RT @astro_timpeake: Today’s exhilarating #spacewalk will be etched in my memory forever – quite an incredible feeling! https://t.co/84Dn3gH…']
plt.figure(figsize=(10,10))
twitter_data.Favorite_Count.plot(kind='line')
plt.xlabel('Tweet Counter',fontsize=13, color='black',fontweight='bold',fontname='Verdana')
plt.ylabel('Favorite Counter',fontsize=13, color='black',fontweight='bold',fontname='Verdana')
plt.title('Distribution of favorites counts for NASA tweets of year 2016 ',fontweight='bold',color='black',fontsize=18,
          fontname='Arial')
plt.xticks(fontname='Verdana', color='blue',fontsize=10,fontweight='bold')
plt.yticks(fontname='Verdana', color='blue',fontsize=10,fontweight='bold')
plt.tight_layout()
plt.show()
plt.figure(figsize=(10,10))
twitter_data.Retweet_Count.plot(kind='line')
plt.xlabel('Tweet Counter',fontsize=13, color='black',fontweight='bold',fontname='Verdana')
plt.ylabel('Retweet Counter',fontsize=13, color='black',fontweight='bold',fontname='Verdana')
plt.title('Distribution of retweet counts for NASA tweets of year 2016 ',fontweight='bold',color='black',fontsize=18,
          fontname='Arial')
plt.xticks(fontname='Verdana', color='blue',fontsize=10,fontweight='bold')
plt.yticks(fontname='Verdana', color='blue',fontsize=10,fontweight='bold')
plt.tight_layout()
plt.show()

 

 

Some of the most favorited and retweeted tweets had links to users such as POTUS, Chicago Cubs, Mohammed Ali,  etc. It figures that popular users being linked to a tweet gathers a lot of attention. Note that although these posts are not directly talking about NASA engineering or science, they still contain links to them.

The maximum number of retweets by NASA were of tweets of  International Space Station (ISS) and astronauts onboard the ISS. The year 2016 corresponded to part of the the #Yearinspace campaign where twin brothers Mark Kelly and Scott Kelley spent a year in space for the Twins Study. Twins Study was a project where NASA studied the effects and changes that occur in spaceflight as compared to Earth on the genetics of humans. 

Data Processing : Extracting information from the tweets

Before I could extract the tweet topic, I had to get rid of noisy content such as stopwords (if, the, and, etc.), mentions, hyperlinks etc., which didn’t contain any information about the research topic that the tweet belonged to. I also got rid of tweets which didn’t have images and those which were replies to individual users. I got rid of tweets without images to  prevent any confounding variable effects arising from the hypothesis that tweets with pictures solicit more favorites due to visual appeal (Overwhelming majority of tweets were accompanied by images and hence I did not end up losing significant amount of data (less than 1% of tweets were without images). Here is the code

# Clean up text to remove hashtags,retweet tags and hyperlinks
def cleanup_tweets(tweets):
    tweets=tweets.str.lower() 
    retweet_index=tweets.str.startswith('rt')
    tweets[tweets.loc[retweet_index].index]=tweets[tweets.loc[retweet_index].index].str.split(' ').str[2:].str.join(' ')
    tweets=[" ".join(filter(lambda x:x[0]!='@', tweet.split())) for tweet in tweets]
    tweets=[re.sub(r"http\S+", "", tweets[counter]).rstrip(' ') for counter in range(len(tweets))]
    tweets=[re.sub(r"http\S+", "", tweets[counter]).rstrip(' ') for counter in range(len(tweets))]
    return(tweets)
# Get rid of images which are replies to individual users and those which don't have images.
# Those 
reply_index=twitter_data_2016.Text.str.startswith('@')
twitter_data_2016.drop(reply_index,inplace=True)
twitter_data_2016.reset_index(inplace=True)
no_image_indices=[index_count for index_count in range(len(twitter_data_2016.Text)) if
                  'http' not in twitter_data_2016['Text'].iloc[index_count]]
twitter_data_2016.drop(no_image_indices,inplace=True)
twitter_data_2016.reset_index(inplace=True)
# clean up the text to remove stopwords and some obvious noise words
import nltk 
import re
def process_tweet(text):

    # load nltk's English stopwords as variable called 'stopwords'

    stopwords = nltk.corpus.stopwords.words('english')

    # Run it through a stemmer to get rid off morphological affixes 
    stemmer = nltk.stem.snowball.SnowballStemmer("english")

    Text=re.sub('\'|\n|\:|\;|\?|\&|\#|\,|\!|\-|\~|\$|\*|\@|\(|\)|[0-9]+|[^\x00-\x7F]+|\x0c|\%|http\S+|rt|\.|amp', '',text)
    Text=Text.lower()
    Tokens=nltk.word_tokenize(Text)

    delete_list=["CC","CD","DT","EX","FW",'IN',"PDT","POS","PRP","PRP$","WDT","WP","WP","WRB"]
    # for text in texts:
    stopwords.extend([u'fy',u'nasa',u'esa',u'mission',u'fund',u'budget',u'estimate',u'es',u'et',u'',u'pm',u'am']) 
    # These are some common words in the budget document which is just noise for our purposes.
    filtered_text=[x[0] for x in nltk.pos_tag(Tokens) if x[1] not in delete_list if x[0] not in stopwords]
    stemmed_text = [stemmer.stem(t) for t in filtered_text]
    
    return stemmed_text
twitter_data_2016.Text=cleanup_tweets(twitter_data_2016.Text)
Processed_Tweets=[process_tweet(x) for x in twitter_data_2016.Text]
# Store the processed and cleaned twitter information for later use.
import pickle
with open('Processed_Tweets', 'wb') as f: 
    pickle.dump(Processed_Tweets, f, protocol=2)
    
with open('twitter_data_2016', 'wb') as f: 
    pickle.dump(twitter_data_2016, f)

Model building:  Topic Assignment

After extracting the tweets, the next step was to figure out which of the topics, each of the tweet I extracted,  belonged to. If we already had a set of labeled data of NASA research topics, this would have be a straightforward supervised learning problem. However, there was no such dataset available for me to use.

To begin with, I decided to try a few unsupervised topic modeling approaches approaches like Latent Dirichlet Allocation ( LDA) . As unlikely as it might seem that this should have worked, I wanted to rule out the hypothesis that natural groupings of data were reflective of the topics in question. This approach did not work out too well even after filtering out the  sources of noise from the tweets. You can head over here to see some of my experiments using unsupervised approaches. The next approach was for me to either manually label a few hundreds of tweets and use them to train a model to classify the remaining thousands of tweets, or, come up with a different methodology. I did not want to spend my time annotating hundreds of tweets and hence decided to continue for alternative  approaches for creating a labeled dataset for training my supervised learning algorithm.

 I decided to look for documents describing these NASA research topics and mine them to create my features or input space.  (The document header which was the topic name would be the corresponding labels or the output).  My search concentrated  on finding documents released by NASA that detail what projects and activities come under each of its research topic.  On searching the NASA website, I stumbled upon a budget document that is released every year and  details exactly this type of information. Here is the link to find these documents, where the pdf titled FY 2017 Budget Estimates is the document of interest.  We can get the previous years’ documents under the heading of  Previous Years’ Budgets. This document had all the existing projects that are currently in progress, information on the topics and projects which go beyond the cursory definitions and short summaries. Since the tweets correspond to current topics of research, this seemed like a good source for creating my labeled data set. Also since this document is released every fiscal year,  by using documents from couple of years ago,  I was able to add some historic perspective to account for  any throwback posts, and to augment the information on all current programs. 

These documents were pdf documents which contained texts, images and tables. I used the pdfminer API to extract the text. The document contained  information on all the topics of our interest with each topic being assigned a section.  I created each line of text as a data point with the labels being the section headers and a line of text being the unrefined input space. 

Sections corresponding to the topics Construction and environmental compliance. Mission Services  &  Inspector general were  very small in size in all the documents.  After a quick read-through, I realized those two topics corresponded to NASA housekeeping and administration information, neither of which seemed like topics that get tweeted about. I decided to ignore these two topics and train my classifiers with text pertaining to the remaining topics. Attached below is a snapshot of page 1 of the 2017 budget document, detailing topic-budget allocation.

NASA Budget Estimates

As it can be seen from the figure linked above, the federal budget breakup does not stop at the topic level but also decides how much each group or activity within a topic ends up getting. The topic Science comprises the following groups within it: Earth Science, Heliophysics, Planetary Science, James Webb Telescope and Astrophysics.  When it comes to discussions of the budget breakup, the sub disciplines of Science are often discussed separately unlike other topics. This perhaps has to do with the Earth Science  being linked to Climate Science, one of the hotly debated topics of space studies. Driven by these two factors, the topics or the output  labels I used in my model were: (1) Earth Science (2) Heliophysics (3) Planetary Science (4) James Webb Telescope (5) Astrophysics (6) Exploration (7) Space Operations (8) Space Technology (9)  Aeronautics (10) Education.

By iterating through few years worth of the budget documents, I had a few thousand lines of text data for building my multiclass classification algorithm. Here is how I extracted the text.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfdevice import PDFDevice
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
from compiler.ast import flatten
import nltk
import re
import numpy as np
page_range_2017=[range(42,125,1),range(125,189,1),range(189,235,1),range(235,245,1),range(245,299,1),range(299,349,1),
            range(349,387,1),range(387,453,1),range(453,532,1),range(532,566,1),range(566,637,1)]
page_range_2016=[range(48,130,1),range(130,183,1),range(183,226,1),range(226,236,1),range(236,288,1),range(288,329,1),
            range(329,368,1),range(368,428,1),range(428,503,1),range(503,567,1),range(567,595,1)]
page_range_2015=[range(51,122,1),range(123,181,1),range(182,221,1),range(222,231,1),range(232,278,1),range(279,320,1),
            range(321,362,1),range(363,415,1),range(416,486,1),range(487,517,1),range(518,586,1)]
page_ranges=[page_range_2017,page_range_2016]
paths=['fy2017_budget_estimates.pdf','fy2016_budget_estimates.pdf','fy2015_budget_estimates.pdf']
Topic_List=['Earth Science','Planetary Science', 'Astrophysics', 'James Webb Telescope', 'Heliophysics','Aeronautics',
            'Space Technology','Exploration','Space Operations','Education','Others']
def pdf_to_text(path,pagenos):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'ascii'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    maxpages = 0
    caching = True



    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    Text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return Text
def process_text(text):

    # load nltk's English stopwords as variable called 'stopwords'

    stopwords = nltk.corpus.stopwords.words('english')

    # Run it through a stemmer to get rid off morphological affixes 
    stemmer = nltk.stem.snowball.SnowballStemmer("english")

    Text=re.sub('\'|\n|\:|\;|\?|\&|\#|\,|\!|\-|\~|\$|\*|\(|\)|[0-9]+|[^\x00-\x7F]+|\x0c|\%|http\S+', '',text)
    Text=Text.lower()
    Tokens=nltk.word_tokenize(Text)

    delete_list=["CC","CD","DT","EX","FW",'IN',"PDT","POS","PRP","PRP$","WDT","WP","WP","WRB"]
    # for text in texts:
    stopwords.extend([u'fy',u'nasa',u'esa',u'mission',u'fund',u'budget',u'estimate',u'es']) 
    # These are some common words in the budget document which is just noise for our purposes.
    filtered_text=[x[0] for x in nltk.pos_tag(Tokens) if x[1] not in delete_list if x[0] not in stopwords]
    stemmed_text = [stemmer.stem(t) for t in filtered_text]
    
    return stemmed_text
Pdf_To_Document=[]
for counter1 in range(len(page_range_2016)):
    text=[]
    for counter2 in range(len(page_ranges)):
        pages=page_ranges[counter2][counter1]
        text.append(pdf_to_text(paths[counter2],pages))

    text=[','.join(flatten(text))]   

    # document is a list, where each item is a string representing a text
    # pertaining to a topic for all three years.

    Pdf_To_Document.append(flatten(text))
Combined_Text=Pdf_To_Document
Input_Space=[]
# for counter in range(len(Pdf_To_Document)):
for counter in range(len(Combined_Text)):
    Input_Topic_List=[]
    Input_Topic_List.append(process_text(Combined_Text[counter][0]))
    Input_Topic_List=[','.join(flatten(Input_Topic_List))]  
    Input_Space.append(flatten(Input_Topic_List))
import pickle
with open('Input_Space', 'w') as f:  # Python 3: open(..., 'wb')
    pickle.dump(Input_Space, f)

Once I had extracted the useful texts and topics, I had to convert them into numeric vectorspace to apply any m/c learning algorithms to  perform subsequent analysis.  I used the Paragraph Vector technique which is coded as doc2vec algorithm in Gensim to do this. Doc2vec modifies the famous word2vec algorithm (word2vec baffled the world with the ability to produce interesting results from textual data like “king – man + woman = queen”) to unsupervised learning of continuous representations for larger blocks of text, such as sentences, paragraphs or entire documents.  The paragraph vector algorithm is a shallow neural network that maps textual data such that semantically similar text sequences(sentences, paragraphs) have similar vector representations. Once trained, this algorithm can be used to infer numerical vectors for unseen text on the related subject matter and also be used to predict new sequences of text given a word or predict a word given a text sequence.

Before applying doc2vec, I first filtered my data from the pdf documents  to remove noisy content such as tags for images, tables etc. This cleaned up data was fed into the doc2vec algorithm and I obtained the numerical vectors that represented the textual sequences in my document. (You can choose the size of the vector space, number of neighboring sequences to consider, n-gram sizes, etc.) . Here is how I did this.

import numpy as np
import pickle
import re
import matplotlib.pyplot as plt
%matplotlib inline 
import gensim
import os
import collections
import smart_open
import random
# Load the processed text from wiki+NASA budget text 
with open('Input_Space','rb') as f:  
   Input_Space = pickle.load(f)
# Load the processed tweets
with open('Processed_Tweets','rb') as f:  
   Processed_Tweets = pickle.load(f)
Topic_List=['Earth Science','Planetary Science', 'Astrophysics', 'James Webb Telescope', 'Heliophysics','Aeronautics',
            'Space Technology','Human Exploration','Space Operations','Education']
# Process the pdf data
Input_Space_Sentences=[]
for counter in range(len(Input_Space)):
    list_of_sentences=[list(filter(None, re.sub("[^a-zA-Z]"," " ,x).split(' '))) for x in str(Input_Space[counter]).split('.')]
    Input_Space_Sentences.append(list_of_sentences)
    
train_corpus=[]
counter3=0
for counter in range(len(Topic_List)):
    for counter2 in range(len(Input_Space_Sentences[counter])):
#         sentences = gensim.models.doc2vec.LabeledSentence(words= Input_Space_Sentences[counter][counter2], tags=[counter3])
#         counter3=counter3+1
        sentences = gensim.models.doc2vec.LabeledSentence(words= Input_Space_Sentences[counter][counter2], tags=[counter])
    
        train_corpus.append(sentences)
        
        
tweet_corpus=[]
counter3=0
for counter2 in range(len(Processed_Tweets)):

    sentences = gensim.models.doc2vec.LabeledSentence(words=Processed_Tweets[counter2],tags=[0])
    tweet_corpus.append(sentences)
# 
import itertools
combs=[]
for size_value in list(itertools.combinations([100,200,300],1)):
    for window_value in list(itertools.combinations([5,7,10],1)):
        for min_count in list(itertools.combinations([2,4,6],1)):
            combs.append(size_value+window_value+min_count)
print("Number of hyperparameter combinations to test = ", len(combs))

model_num=1
for item in combs:
    model = gensim.models.doc2vec.Doc2Vec(size=item[0], min_count=item[2], iter=10,workers=4,alpha=0.025, 
                                          min_alpha=0.025,window=item[1])
    model.build_vocab(train_corpus)
    random.shuffle(train_corpus)
    random.shuffle(train_corpus)
    random.shuffle(train_corpus)
#     %time model.train(train_corpus)
    for epoch in range(10):
        random.shuffle(train_corpus)
        model.train(train_corpus)
        model.alpha -= 0.002  # decrease the learning rate
        model.min_alpha = model.alpha  # fix the learning rate, no decay

    # Use this if you assign tags yourself and not just serialize
    counter=0
    for doc_id in range(len(train_corpus)):
        inferred_vector = model.infer_vector(train_corpus[doc_id].words)
        sims=model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))[0][0]
        if sims==train_corpus[doc_id].tags[0]:
            counter=counter+1

    # Represent the text in-terms of word embeddings
    input_matrix= np.zeros((len(train_corpus), item[0]))
    output_label= np.zeros((len(train_corpus)))
    for counter in range(len(train_corpus)):
        input_matrix[counter]=model.infer_vector(train_corpus[counter].words)
        output_label[counter]=train_corpus[counter].tags[0]
    
    # Divide the data into test and train 
    from sklearn import cross_validation
    from sklearn.model_selection import StratifiedKFold
    skf = cross_validation.StratifiedKFold(output_label, n_folds=3, random_state=1) 
    # We set the random state to ensure every iteration has the same indices being tested and trained
    for train_index, test_index in skf:

        X_train, X_test = input_matrix[train_index],  input_matrix[test_index]
        y_train, y_test = output_label[train_index],  output_label[test_index]

    # topic assignment using Logistic regression with inbulit cross validation for setting the regularization parameter 
    
    from sklearn.linear_model import LogisticRegressionCV
    from sklearn.metrics import log_loss
    
    classifier=LogisticRegressionCV(cv=5)
    
    classifier.fit(X_train, y_train)
    clf_probs = classifier.predict_proba(X_test)
    score = log_loss(y_test, clf_probs)
    print('Log loss score on the test set for',item, 'is',score, 'Accuracy score is ',classifier.score(X_test,y_test))
    print('model num', model_num)
    model_num=model_num+1

Once I had the numeric representation for my feature space, I encoded my text labels into numbers. Now that I had my labeled dataset in numeric form, I was ready to apply supervised learning algorithms for training a classifier that could classify texts into into one of the above mentioned research topics. I used logistic regression to train a classifier capable of a classifying NASA related text into one of the 10 above mentioned topics. Here is the related code.

 from sklearn import svm, metrics
svc = svm.SVC(kernel='linear',C=50)
svc.fit(X_train, y_train)
expected = y_test
predicted = svc.predict(X_test)
# Represent the text in-terms of word embeddings
tweet_matrix= np.zeros((len(tweet_corpus), 200))
for counter in range(len(tweet_corpus)):
    tweet_matrix[counter]=model.infer_vector(tweet_corpus[counter].words)
twitter_topic=svc.predict(tweet_matrix)
twitter_topic=[int(x) for x in twitter_topic]
Twitter_Topics=[Topic_List[x] for x in twitter_topic]
import pandas as pd
Twitter_Topics=pd.Series(Twitter_Topics)
# Average Likes, retweets, favorites
# 
import pickle
with open('Topics_Tweets','wb+') as f: 
    pickle.dump(Twitter_Topics, f)

After topic assignment, I extracted the favorites and retweet counts for all the tweets under each of the ten topics. 

# Script to assign topics to tweets
import numpy as np
import pickle
import re
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
import pylab as pyplot
import pandas as pd
import os
import collections
from IPython.display import display
# Load the processed tweets
with open('twitter_data_2016', 'rb') as f: 
    twitter_data_2016 = pickle.load(f)
    

# Create lists of likes and retweets for each topic to enable comparision and further processing
SpaceOperations_FavCount=list(twitter_data_2016.groupby('Topic').get_group('Space Operations')['Favorite_Count']
                              .dropna().values)
SpaceOperations_RetweetCount=list(twitter_data_2016.groupby('Topic').get_group('Space Operations')['Retweet_Count']
                                  .dropna().values)
Aeronautics_FavCount=list(twitter_data_2016.groupby('Topic').get_group('Aeronautics')['Favorite_Count'].
                          dropna().values)
Aeronautics_RetweetCount=list(twitter_data_2016.groupby('Topic').get_group('Aeronautics')['Retweet_Count'].
                              dropna().values)
Heliophysics_FavCount=list(twitter_data_2016.groupby('Topic').get_group('Heliophysics')['Favorite_Count'].
                           dropna().values)
Heliophysics_RetweetCount=list(twitter_data_2016.groupby('Topic').get_group('Heliophysics')['Retweet_Count'].
                               dropna().values)
EarthScience_FavCount=list(twitter_data_2016.groupby('Topic').get_group('Earth Science')['Favorite_Count'].
                           dropna().values)
EarthScience_RetweetCount=list(twitter_data_2016.groupby('Topic').get_group('Earth Science')['Retweet_Count'].
                               dropna().values)
JamesWebbTelescope_FavCount=list(twitter_data_2016.groupby('Topic').get_group('James Webb Telescope')['Favorite_Count']
                                 .dropna().values)
JamesWebbTelescope_RetweetCount=list(twitter_data_2016.groupby('Topic').get_group('James Webb Telescope')
                                     ['Retweet_Count'].dropna().values)
SpaceTechnology_FavCount=list(twitter_data_2016.groupby('Topic').get_group('Space Technology')['Favorite_Count']
                              .dropna().values)
SpaceTechnology_RetweetCount=list(twitter_data_2016.groupby('Topic').get_group('Space Technology')['Retweet_Count']
                                  .dropna().values)
Education_FavCount=list(twitter_data_2016.groupby('Topic').get_group('Education')['Favorite_Count'].
                        dropna().values)
Education_RetweetCount=list(twitter_data_2016.groupby('Topic').get_group('Education')['Retweet_Count'].
                            dropna().values)
Astrophysics_FavCount=list(twitter_data_2016.groupby('Topic').get_group('Astrophysics')['Favorite_Count'].
                           dropna().values)
Astrophysics_RetweetCount=list(twitter_data_2016.groupby('Topic').get_group('Astrophysics')['Retweet_Count'].
                               dropna().values)
PlanetaryScience_FavCount=list(twitter_data_2016.groupby('Topic').get_group('Planetary Science')['Favorite_Count'].
                               dropna().values)
PlanetaryScience_RetweetCount=list(twitter_data_2016.groupby('Topic').get_group('Planetary Science')
                                   ['Retweet_Count'].dropna().values)
Exploration_FavCount=list(twitter_data_2016.groupby('Topic').get_group('Human Exploration')['Favorite_Count'].
                          dropna().values)
Exploration_RetweetCount=list(twitter_data_2016.groupby('Topic').get_group('Human Exploration')['Retweet_Count'].
                              dropna().values)
topics=['Space Operations','Aeronautics','Heliophysics','EarthScience', 'JamesWebbTelescope','Space Technology',
                 'Education','Astrophysics','PlanetaryScience','Exploration']
fav_array=[SpaceOperations_FavCount,Aeronautics_FavCount,Heliophysics_FavCount,EarthScience_FavCount,JamesWebbTelescope_FavCount,
 SpaceTechnology_FavCount,Education_FavCount,Astrophysics_FavCount,PlanetaryScience_FavCount,Exploration_FavCount]
retweet_array=[SpaceOperations_RetweetCount,Aeronautics_RetweetCount,Heliophysics_RetweetCount,EarthScience_RetweetCount,
               JamesWebbTelescope_RetweetCount,SpaceTechnology_RetweetCount,Education_RetweetCount, Astrophysics_RetweetCount, 
               PlanetaryScience_RetweetCount,Exploration_RetweetCount]

Here is what NASA tweets grouped by research topics, distribution of retweets and favorites by topics,  looks like for the year 2016. 

# figure 1
# Plot number of posts belonging to each topic to inspect group sizes
plt.figure(figsize=(12.5,7))
column='Topic'
ylabel_text='Tweets by Topic (%)'
w=len(twitter_data_2016[column].value_counts())*0.1
ax=(twitter_data_2016[column].value_counts(normalize=False)).plot(kind='barh',width=w,legend=False, edgecolor='red',
                                                        fontsize=15,grid=True,color='blue')
plt.title(column.title(),fontweight='bold',color='black',fontsize=18,fontname='Arial')
ax.tick_params(axis='both', labelsize=10,color='blue')
#     ax.set_xticklabels(column)
plt.xlabel(ylabel_text,fontsize=13, color='black',fontweight='bold',fontname='Verdana')
plt.xticks(fontname='Verdana', color='blue',fontsize=10,fontweight='bold')
plt.yticks(fontname='Verdana', color='blue',fontsize=10,fontweight='bold')
#   plt.xlabel([],fontsize=15, color='black',fontweight='bold',fontname='Verdana')
ax.set_axis_bgcolor('gray')
# ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=90)
plt.tight_layout()
# ax.set_xlim(0,500)
# figure 2
#  Groups likes and favorites by topics and plot to see what the data looks like under each group (topic)
object2=['Favorite_Count', 'Retweet_Count']
object1=['Topic']
for object11 in object1:
    plt.figure(figsize=(7.5,5))
    counter=1
    for object22 in object2:
#         plt.subplot(1,len(numeric_columns),counter)
        plt.figure(figsize=(15,7))
        sns.set(style="whitegrid", color_codes=True)
        ax=sns.boxplot(y=object11, x=object22, data= twitter_data_2016,palette="Set1")
        plt.title(str(object11)+' vs '+str(object22),fontweight='bold',color='black',fontsize=18,fontname='Arial')
        ax.tick_params(axis='both', labelsize=10,color='blue')
    #     ax.set_xticklabels(column)
        plt.xlabel(object22,fontsize=13, color='black',fontweight='bold',fontname='Verdana')
        plt.xticks(fontname='Verdana', color='blue',fontsize=10,fontweight='bold')
        plt.yticks(fontname='Verdana', color='blue',fontsize=10,fontweight='bold')
        plt.ylabel(object11,fontsize=13, color='black',fontweight='bold',fontname='Verdana')
        ax.set_axis_bgcolor('white')
#         ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=0)
        plt.tight_layout()
        ax.set_xlim(0, 10000)
        counter=counter+1
 

 

Testing if response to tweets differs across topics

Once I had the retweet and favorite counts grouped by topics, I wanted to check for the statistical and practical significance of the differences between them. But, before I could apply a hypothesis test to do the same, I had to look for the distribution, dispersion and shape of the favorite and retweet counts under each topic, in order to select the right test. Here is the code which does just that.

# figure 3
# Shape of distribution of each group (topic)
plt.figure(figsize=(20,16))
# fig2=plt.figure(figsize=(24,24))
for counter in range(len(topics)):
    
#     plt.figure('fig1')
    plt.subplot(1,2,1)
    ax=sns.distplot(fav_array[counter], hist=False,label=topics[counter])
    pyplot.legend(loc=1, fontsize = 'x-large')
    ax.tick_params(axis='both', labelsize=24,color='blue')
    ax.tick_params(axis='both', labelsize=24,color='blue')
    plt.title( 'Favorites for different topics',fontname='Verdana', color='blue',fontsize=24,fontweight='bold')
    plt.xlabel('Fav_Counts',fontsize=24, color='black',fontweight='bold',fontname='Verdana')
    plt.xticks(fontname='Verdana', color='blue',fontsize=16,fontweight='bold')
    plt.yticks(fontname='Verdana', color='blue',fontsize=16,fontweight='bold')
    ax.set_axis_bgcolor('white')
    ax.set_xlim(-2000,8000)
    plt.tight_layout()

    
#     plt.figure('fig1')
    plt.subplot(1,2,2)
    ax=sns.distplot(retweet_array[counter], hist=False,label=topics[counter])
    pyplot.legend(loc=1, fontsize = 'x-large')
    ax.tick_params(axis='both', labelsize=24,color='blue')
    ax.tick_params(axis='both', labelsize=24,color='blue')
    plt.title( 'Retweets for different topics',fontname='Verdana', color='blue',fontsize=24,fontweight='bold')
    plt.xlabel('Retweet_Counts',fontsize=24, color='black',fontweight='bold',fontname='Verdana')
    plt.xticks(fontname='Verdana', color='blue',fontsize=16,fontweight='bold')
    plt.yticks(fontname='Verdana', color='blue',fontsize=16,fontweight='bold')
    ax.set_axis_bgcolor('white')
#     ax.set_xlim(-1000,4000)
    plt.tight_layout()
    

    
# for counter in range(len(topics)):
for counter in range(len(topics)):    
    plt.figure(figsize=(12,6))
    plt.subplot(1,2,1)
    ax=sns.distplot(fav_array[counter], hist=True,kde_kws={"color": "g", "lw": 5}, hist_kws={"histtype": "step", "linewidth": 3,"alpha": 1, "color": "g"})
    ax.tick_params(axis='both', labelsize=8,color='blue')
    ax.tick_params(axis='both', labelsize=8,color='blue')
    plt.title( topics[counter]+' Favorites',fontname='Verdana', color='blue',fontsize=10,fontweight='bold')
    plt.xlabel('Fav_Counts',fontsize=13, color='black',fontweight='bold',fontname='Verdana')
    plt.xticks(fontname='Verdana', color='blue',fontsize=8,fontweight='bold')
    plt.yticks(fontname='Verdana', color='blue',fontsize=8,fontweight='bold')
    ax.set_axis_bgcolor('white')
    plt.tight_layout()
    
    plt.subplot(1,2,2)
    ax=sns.distplot(retweet_array[counter], hist=True,kde_kws={"color": "r", "lw": 3}, hist_kws={"histtype": "step", "linewidth": 3,"alpha": 1, "color": "r"})
    ax.tick_params(axis='both', labelsize=8,color='blue')
    ax.tick_params(axis='both', labelsize=8,color='blue')
    plt.title( topics[counter]+' Retweets',fontname='Verdana', color='blue',fontsize=10,fontweight='bold')
    plt.xlabel('Retweet_Counts',fontsize=13, color='black',fontweight='bold',fontname='Verdana')
    plt.xticks(fontname='Verdana', color='blue',fontsize=8,fontweight='bold')
    plt.yticks(fontname='Verdana', color='blue',fontsize=8,fontweight='bold')
    ax.set_axis_bgcolor('white')
    plt.tight_layout()
        
 

As it can be seen from figures above, the group sizes (tweets under each topic) were unequal, data in each group followed a non-normal distribution with the presence of outliers. The presence of outliers pointed towards a possible presence of heteroskedasticity. In the presence of multiple violations of ANOVA, especially with unequal group sizes, nonparametric tests are required. Either Welch’s test, or , transforming the data prior to applying other nonparametric tests was the route forward. The absence of normality made me go down the route of transformations, where I winsorized the 10% of right tailed data in all groups.  Here is the code for transforming the data and for investigating the characteristics of the data in each group after the transformation.

# Winsorize data to remove outliers
df_retweet=pd.DataFrame(retweet_array).transpose()
# df_retweet=df_retweet.applymap(np.sqrt)
df_retweet.where(~(df_retweet > df_retweet.quantile(0.90)), df_retweet.quantile(0.90), axis=1,inplace=True)
df_retweet_list=[]
for x in range(10):
    df_retweet_list.append(df_retweet[x].dropna().values)
    
df_fav=pd.DataFrame(fav_array).transpose()
# df_fav=df_fav.applymap(np.sqrt)
df_fav.where(~(df_fav > df_fav.quantile(0.90)), df_fav.quantile(0.90), axis=1,inplace=True)
df_fav_list=[]
for x in range(10):
    df_fav_list.append(df_fav[x].dropna().values)
    
    
df_retweet.rename(columns=dict( zip(list(range(10)), topics)),inplace=True)
df_fav.rename(columns=dict( zip(list(range(10)), topics)),inplace=True)

Here is a visualization of the transformed data:

plt.figure(figsize=(15,7))
sns.set(style="whitegrid", color_codes=True)
ax=sns.boxplot(x=df_retweet,orient="h", palette="Set1")  
ax.tick_params(axis='both', labelsize=10,color='blue')
plt.title('Retweet Counts by Topics',fontsize=15, color='blue',fontweight='bold',fontname='Verdana')
plt.xlabel('Transformed Retweet Count',fontsize=13, color='black',fontweight='bold',fontname='Verdana')
plt.xticks(fontname='Verdana', color='blue',fontsize=10,fontweight='bold')
plt.yticks(fontname='Verdana', color='blue',fontsize=10,fontweight='bold')
plt.ylabel('NASA Research Topics',fontsize=13, color='black',fontweight='bold',fontname='Verdana')
ax.set_axis_bgcolor('white')
plt.tight_layout()
plt.figure(figsize=(15,7))
sns.set(style="whitegrid", color_codes=True)
ax=sns.boxplot(x=df_fav,orient="h", palette="Set1")  
plt.title('Favorite Counts by Topics',fontsize=15, color='blue',fontweight='bold',fontname='Verdana')
ax.tick_params(axis='both', labelsize=10,color='blue')
plt.xlabel('Transformed Favorite Count',fontsize=13, color='black',fontweight='bold',fontname='Verdana')
plt.xticks(fontname='Verdana', color='blue',fontsize=10,fontweight='bold')
plt.yticks(fontname='Verdana', color='blue',fontsize=10,fontweight='bold')
plt.ylabel('NASA Research Topics',fontsize=13, color='black',fontweight='bold',fontname='Verdana')
ax.set_axis_bgcolor('white')
plt.tight_layout()
plt.figure(figsize=(20,16))
# fig2=plt.figure(figsize=(24,24))
for counter in topics:
    
#     plt.figure('fig1')
    plt.subplot(1,2,1)
    ax=sns.distplot(df_retweet[counter], hist=False,label=counter)
    pyplot.legend(loc=1, fontsize = 'x-large')
    ax.tick_params(axis='both', labelsize=24,color='blue')
    ax.tick_params(axis='both', labelsize=24,color='blue')
    plt.title( 'Favorites for different topics',fontname='Verdana', color='blue',fontsize=24,fontweight='bold')
    plt.xlabel('Tranformed Fav_Counts',fontsize=24, color='black',fontweight='bold',fontname='Verdana')
    plt.xticks(fontname='Verdana', color='blue',fontsize=16,fontweight='bold')
    plt.yticks(fontname='Verdana', color='blue',fontsize=16,fontweight='bold')
    ax.set_axis_bgcolor('white')
#     ax.set_xlim(-30,70)
    plt.tight_layout()

    
#     plt.figure('fig1')
    plt.subplot(1,2,2)
    ax=sns.distplot(df_fav[counter], hist=False,label=counter)
    pyplot.legend(loc=1, fontsize = 'x-large')
    ax.tick_params(axis='both', labelsize=24,color='blue')
    ax.tick_params(axis='both', labelsize=24,color='blue')
    plt.title( 'Retweets for different topics',fontname='Verdana', color='blue',fontsize=24,fontweight='bold')
    plt.xlabel('Transformed Retweet_Counts',fontsize=24, color='black',fontweight='bold',fontname='Verdana')
    plt.xticks(fontname='Verdana', color='blue',fontsize=16,fontweight='bold')
    plt.yticks(fontname='Verdana', color='blue',fontsize=16,fontweight='bold')
    ax.set_axis_bgcolor('white')
#     ax.set_xlim(-30,70)
    plt.tight_layout()
    
 

As it can been seen from the above figures,  although the transformed data was free of outliers, it did not have the same shape of distribution across the groups. Hence the nonparametric tests could only compare the stochastic dominance of groups and not medians. I applied the Mann–Whitney U test for testing the pairwise stochastic dominance between different topics. Under the null hypothesis H0 of Mann Whitney- U test, the probability of an observation from the population X exceeding an observation from the second population Y equals the probability of an observation from Y exceeding an observation from X: P(X > Y) = P(Y > X) or P(X > Y) + 0.5 · P(X = Y) = 0.5. The alternative hypothesis H1 is “the probability of an observation from the population X exceeding an observation from the second population Y is different from the probability of an observation from Y exceeding an observation from X: P(X > Y) ≠ P(Y > X).” 

To account for Type 1 error arising from multiple comparisons, I used Bonferroni correction. I used rank biserial correlation or Cliff’s delta and the common language effect size statistic (CL statistic) to determine the effect size [Grissom and Kim (2012)].   The rank biserial correlation is the difference between the proportion of pairs favorable to the hypothesis minus the proportion that is unfavorable. CL statistic is the probability that a case randomly selected from one group will have a higher score than a case randomly selected from the other group. The rank biserial correlation is analogous to the cohen’s d measure for non parametric analysis. In  scenarios where effect sizes of experiments are widely available and a consensus on classification into small, medium and large for the field of study has been established, Cliff’s’ delta would be good choice for understanding the effect size. Since that is not the case here, CL statistic is more intuitive for understanding the practical significance. However for the sake of completeness, it has been reported along with other metrics given out by the test. Here is the code for conducting the hypothesis test. 

from scipy.stats import mannwhitneyu
from itertools import combinations
# mannwhitneyu tests for pairwise differences
topics=['SpaceOperations','Aeronautics','Heliophysics','EarthScience', 'JamesWebbTelescope','SpaceTechnology',
                 'Education','Astrophysics','PlanetaryScience','Exploration']
T=[]
CL=[]
R=[]
N=[]
U=[]
P=[]
print('Favorite Counts')
# print('\n')
# print('Effect size gives the strength of P(X>Y)<P(Y>X) for pairwise comparisions for which we reject the null hypothesis')
# print('\n')
for c in combinations(range(0,10),2):
    [s,p]=mannwhitneyu(df_fav_list[c[0]],df_fav_list[c[1]],alternative='two-sided')
    r=(1-(2*s/(len(df_fav_list[c[0]])*len(df_fav_list[c[1]]))))
    cl=s/(len(df_fav_list[c[0]])*len(df_fav_list[c[1]]))
    if (r>=0):
        T.append(topics[c[1]]+' > '+topics[c[0]])
        R.append(r)
        CL.append(1-cl)
        P.append(p)
        U.append(s)
        N.append(str(len(df_fav_list[c[1]]))+', '+str(len(df_fav_list[c[0]])))
    elif (r<0):
        T.append(topics[c[0]]+' > '+topics[c[1]])
        R.append(-r)
        CL.append(cl)
        P.append(p)
        U.append(s)
        N.append(str(len(df_fav_list[c[0]]))+', '+str(len(df_fav_list[c[1]])))
d={'P-value':P,'U-statistic':U,'cliffs delta':R,'CL statistic':CL,'Group Sizes':N}
Favorite_Counts_df=pd.DataFrame(d,index=T)
# display(Favorite_Counts_df.sort_index())
print('\n')
print('\n')
print('Statistically Significant Pairwise Comparisions')
print('\n')
Favorite_Counts_Significant=Favorite_Counts_df[Favorite_Counts_df['P-value']<(0.05/9)].sort_index()
display(Favorite_Counts_Significant)
T=[]
CL=[]
R=[]
N=[]
U=[]
P=[]
print('Retweet Counts')
# print('\n')
# print('Effect size gives the strength of P(X>Y)<P(Y>X) for pairwise comparisions for which we reject the null hypothesis')
# print('\n')
for c in combinations(range(0,10),2):
    [s,p]=mannwhitneyu(df_retweet_list[c[0]],df_retweet_list[c[1]],alternative='two-sided')
    r=np.round(1-(2*s/(len(df_retweet_list[c[0]])*len(df_retweet_list[c[1]]))),2)
    cl=np.round(s/(len(df_retweet_list[c[0]])*len(df_retweet_list[c[1]])),4)
    if (r>=0):
        T.append(topics[c[1]]+' > '+topics[c[0]])
        R.append(r)
        CL.append(str(100-cl*100)+'%')
        P.append(p)
        U.append(s)
        N.append(str(len(df_retweet_list[c[1]]))+', '+str(len(df_retweet_list[c[0]])))
    elif (r<0):
        T.append(topics[c[0]]+' > '+topics[c[1]])
        R.append(-r)
        CL.append(str(cl*100)+'%')
        P.append(p)
        U.append(s)
        N.append(str(len(df_retweet_list[c[0]]))+', '+str(len(df_retweet_list[c[1]])))
d={'P-value':P,'U-statistic':U,'cliffs delta':R,'CL statistic':CL,'Group Sizes':N}
Retweet_Counts_df=pd.DataFrame(d,index=T)
# display(Retweet_Counts_df.sort_index())
print('\n')
print('\n')
print('Statistically Significant Pairwise Comparisions')
print('\n')
Retweet_Counts_Significant=Retweet_Counts_df[Retweet_Counts_df['P-value']<(0.05/9)].sort_index()
display(Retweet_Counts_Significant)

Here are the pairwise comparisons of retweet counts and favorite counts for which the null hypothesis was rejected. 

 
Favorite Counts
Statistically Significant Pairwise Comparisons

  CL statistic Group Sizes P-value U-statistic cliffs delta
Astrophysics > Aeronautics 0.670941 264, 109 1.774607e-07 9469.0 0.341882
Astrophysics > EarthScience 0.662043 264, 170 9.793555e-09 15167.5 0.324086
Astrophysics > Education 0.659756 264, 74 2.421092e-05 6647.0 0.319513
Astrophysics > Exploration 0.620141 264, 403 1.167666e-07 65978.0 0.240281
Astrophysics > JamesWebbTelescope 0.634708 264, 119 2.283335e-05 11476.0 0.269417
Astrophysics > PlanetaryScience 0.593168 264, 1235 1.679671e-06 193396.5 0.186336
Astrophysics > SpaceOperations 0.660456 264, 2256 2.785200e-19 202227.0 0.320912
Astrophysics > SpaceTechnology 0.663420 264, 28 4.345425e-03 2488.0 0.326840
Heliophysics > Aeronautics 0.625070 98, 109 1.716905e-03 4005.0 0.250140
Heliophysics > EarthScience 0.611014 98, 170 2.244178e-03 10179.5 0.222029
Heliophysics > Education 0.623966 98, 74 4.984912e-03 4525.0 0.247932
PlanetaryScience > Aeronautics 0.586725 1235, 109 2.444262e-03 55633.0 0.173450
PlanetaryScience > EarthScience 0.574184 1235, 170 1.545200e-03 89400.0 0.148369
PlanetaryScience > Exploration 0.531734 1235, 403 2.435414e-05 264646.5 0.063467
PlanetaryScience > SpaceOperations 0.577809 1235, 2256 1.942969e-17 1176292.5 0.155617

Retweet Counts
Statistically Significant Pairwise Comparisons

  CL statistic Group Sizes P-value U-statistic cliffs delta
Astrophysics > Aeronautics 60.79% 264, 109 1.046224e-03 11284.0 0.22
Astrophysics > EarthScience 59.29% 264, 170 1.073212e-03 18268.5 0.19
Astrophysics > Exploration 57.06% 264, 403 2.034303e-03 60703.5 0.14
Astrophysics > JamesWebbTelescope 64.12% 264, 119 9.726922e-06 11273.5 0.28
Astrophysics > PlanetaryScience 55.47% 264, 1235 5.176251e-03 180864.0 0.11
Astrophysics > SpaceOperations 63.61% 264, 2256 3.879668e-13 216738.0 0.27
EarthScience > SpaceOperations 56.9% 170, 2256 2.555020e-03 165301.5 0.14
Exploration > SpaceOperations 58.35% 403, 2256 8.026439e-08 378675.0 0.17
Heliophysics > JamesWebbTelescope 62.63% 98, 119 1.381155e-03 7303.5 0.25
Heliophysics > SpaceOperations 62.45% 98, 2256 1.425867e-05 83020.0 0.25
PlanetaryScience > JamesWebbTelescope 59.5% 1235, 119 6.045568e-04 59516.5 0.19
PlanetaryScience > SpaceOperations 59.43% 1235, 2256 2.297647e-20 1130228.0 0.19

Observations:

Here are  the salient features of the findings from this study:  

(i)Not all pairwise topics comparisons were found to have statistically significant differences. We do need to remember that this study was comprised of unequal group sizes with few groups having small sample sizes (30-100). Also a nonparametric test was used in this study. Type II errors arising from lack of power should not be ruled out as a possibility for our failure to reject the null hypothesis for the non-statistically significant differences. The data comprised of large number of posts from the topic of space operations  due to the # yearinspace campaign which was largely responsible for our skewed dataset. Posts from a different point of time that formulate a more balanced design would be a nice follow up to this study.

(ii) The few topics that showed higher probability of being ‘favorited’ in comparison to few of their peers, also exhibited a higher probability of retweets in comparison to the same peers. The topic astrophysics exhibited 60%-67% probability of being ranked higher in terms of favorites than other topics, with the sole exception of heliophysics. Again, it dominated most topics (6 out of 9) in terms of retweet counts as well with probability of 55%-64% of being ranked higher. The other topics that recurred as being selectively dominant (dominant over few other topics) under both retweets and favorites, were planetary science and heliophysics (53%-62%). 

(iii) Among the pairwise comparison of topics that were deemed statistically significant, most of the topics (9 out of 12 and 11 out of 15 comparisons of retweets and favorites respectively) comprised of comparisons between science research topics(astrophysics, planetary science and heliophysics) and engineering research topics (aeronautics, space operations, james webb telescope, space technology and education). It was observed that science topics were favorited and retweeted more than engineering topics in all of these comparisons. 

(iv) Going back to the original question about the comparison between funding and popularity,  the topics that were ranked higher than some of their counterparts, did not necessarily show the corresponding dominance in funding. For instance, favorite count of  Astrophysics was found to exhibit dominance over all topics, barring heliophysics. In the funding break-up, this was not the case, with Space operations, human exploration and planetary science being top three funded topics amongst our list (look at the figure titled NASA Budget Estimates in the Topic Assignment section). Infact space operations, was ranked number one amongst the topics in terms of funding, but  was found to have lower or equal ranking of favorite and retweet counts with respect to its counterparts.

Conclusion

The philosophy behind someone hitting the favorite button or retweet button is open to interpretation. When it comes to retweet, it is often attributed to the user supporting the idea conveyed by the tweet and wanting to propagate the idea to their network.  On the other hand, favoriting a tweet is seen in the same light as bookmarking a link for reading later.  Whatever maybe the reasoning behind the actions, in this study, both of them exhibited a similar patterns, when grouped by topics. The topics that showed higher probability of being ‘favorited’ in comparison to few of their peers, also exhibited a higher probability of retweets in comparison to the same peers. 

 Science topics, particularly science of distant worlds (Astrophysics) captured the imagination of the tweet readers the most, since they were retweeted and favorited more than most of their peers. However attributing this dominance to the fact that the scientific merit of Astrophysics  was presumed as higher and hence ranked higher by Twitterverse, would be inaccurate . Infact, drawing any conclusions about the reason for this apparent popularity of some topics when compared to others, is not possible without conducting a study where the respondents reasoning of retweeting and ‘favoriting’ has been established a priori

There was no similarity between  topic comparisons of funding and popularity amongst Twitterverse. However, an interesting trend of science topics dominating favorite count and retweet count by 55%-67%  when compared to their engineering counterparts was observed. 

Addendum:
  • The President’s budgetary office (OMB) reviews the agency programs and their revenue proposals and based on the President’s policies drafts a budget it deems necessary for each agency. This is then sent to the Congress for approval, which does its own scrutiny through its subcommittees and passes an edited version of the budget that the Congress deems as appropriate. Once passed by Congress, the federal agencies get the funding to carry out all of their activities for the fiscal year in question. 
  • Current NASA funding is less than 0.5% of the total federal funding.
  • The topics human exploration and space operations are sometimes combined and are referred together as human exploration and operations. 
  • Prior to  baselining the current methodology, one of the approaches tried was to use Wikipedia articles on NASA and related topics to augment NASA budget documents  ( Wikipedia API which lets you mine and play with wikipedia data  easily) . To do this, I started extracting articles corresponding to each of the topics and then drilled down into all the links from that page and combed down further from those links for new links and so on. I now had thousands of lines of texts belonging to each topic, cleaning which would let me create a labeled dataset. However, the information extracted in this manner was really noisy and filtering out non-pertinent pages was not possible without a  lot of  my manual intervention. You can see how to mine Wikipedia data and the related topics and information gathered from this method here. <Insert the link to wikimining>
  • Although the topic James Webb telescope comes under the bracket of Science for funding purposes, currently it is under construction & assembly and hence any tweets would be about the progress of the engineering activities.
  • Although the favorites and retweets are collected over a period of a year, the time period is assumed have no bearing on the favorite and retweet counts. This assumption is justified since major opinion-shifting events like man going to the moon or first flight of the space shuttle etc., did not occur in this time period under consideration. Only when a major discovery which redefines humankind’s perspective of space, could be a cause of concern. Also, I make the assumption that the Twitterverse demographics engaging on NASA posts did not change significantly over a course of a year. 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *