InfoPython - Measuring the Value of Mass Media Information with Python.

../images/infopython/me.png
Author: Juan Bautista Cabral

JBC met Python a lonely night of 2007. He developed his degree project of the Systems Engineering career with this language using the Django framework and worked 1 year developing information evaluators using our lovely reptile.

blog: http://jbcabral.wordpress.com

mail: jbc.develop@gmail.com

Infopython is a library for the evaluation of media information using formal theories from social sciences. Initially, they were a scattered set of modules used in my work, then in few days of refactoring and patience I managed to unify a single library.

Background

There are different sociological theories to determine the importance of media on public opinion, they analyze the information they emit from the viewpoint of sender, receiver, or both.

To quote one example, the Information Theory of Shannon is a mathematical formalization of a sociological theory known as the Hypodermic Needle.

In the case of Infopython theory under consideration is known as Agenda-Setting (in the future is planned to add others); which posits that mass media have great influence on the public when determining which stories have informative interest and how much space and importance is given to them. The focus of this theory is to assign a higher priority to obtain more audience, greater impact and a determined awareness about the news.

The Media

In Infopython, before mentioning how the processes are for calculating the value of information you need to determine what is what we are measuring.

That way we informally define that for our domain an information medium is:

One emitting element on which I want to make a measurement of its information value. This information has as characteristics: homogeneous, give a "feeling" of unity, and be measurable.

Being:

  • Homogeneous: All information issued by the media must have common characteristics. But given its extreme internal environmental variation we find it impossible to measure.

    For Example:

    a television channel for our case does not qualify as "medium" since each program and time slot has very different levels of audience and content; in this case is best to take as unit each TV show as the medium to measure.

  • Sense of unity: It's easier to grasp this concept from an example:

    If I say that my information medium are "Magazines of Sport" it gives a feeling that this element is not "one medium", yet when I change the definition of the medium, to "Goals and Lanterns sports magazine" this feeling is present.

  • Measurable: If from a medium that we defined we can not extract quantitative data, it makes no sense to us.

Formalizing

Reached the point where we have defined what a medium is for our domain, we can define "mathematically" a model that fits the previous definition of the "Agenda-Setting", and propose the following:

The value of information from one medium is a function of the audience and impact, discarding the conscience as it is difficult or impossible to measure

Or what is the same:

VALUE = F(AUDIENCE, IMPACT)

Being:

  • VALUE: Is the importance of the medium given the theory.
  • AUDIENCE: To how many people the information of the medium reach.
  • IMPACT: How important is the medium to the audience.

Now is being proposed as F:

F(AUDIENCE, IMPACT) = AUDIENCE * IMPACT

The function '*' (Multiplication) is chosen due to:

  • Better reflects changing values: if a parameter increases or decreases much, so varies the value.

    Let's assume the following case:

    A medium which is followed by a low number of audience 10 but has a high impact 1000. This can occur if these few people belong to a group of influence (presidential advisors for example).

    In this case the value of the information would be 10,000, and since we define value as "much" and what value we define as "little", this value is "large" (much). This, we can consider as correct as any medium that can influence important people should have a high value.

    Keeping the values, but changing our function by AUDIENCE + IMPACT the value of the information would be 1010 which, keeping the above reasoning remains high.

    Now, if we replace the value of the audience by a large number 1000, and keep the impact, the value of the original function would be 1 million and the second case in 2000.

    If we consider we now likely impact on 1000 presidential advisors 2000 value remains small, since we are presence of a medium likely to generate a global impact. Which shows that multiplication represents much better variation of parameters.

  • The value vanishes in the absence of audience or impact: When the audience or impact are 0 (zero) (No one sees or pays attention to the medium) the value of information is also 0 (zero).

    This is not trivial because it suggests that the information is worthless if nobody cares to see it or no one pays attention.

Infopython

Since there is a wide variety of public services that extract statistics and data on new media (web, twitter, etc), including:

To quote a few. Infopython focuses on providing a simple API to value through agenda-setting (in the future to implement other theories) to the media regardless of type, using aforementioned services

Architecture:

../images/infopython/arch.png

Analyzing every layer:

  • Internet Service: Corresponds to the different services that exist

    on the Web for statistics and data mining of new media.

  • Other Sources: They are other data which feeds Infopython, such as databases, excel templates, etc.

  • Scipy: It is an open source library of algorithms and

    mathematical tools.

    This handles the necessary number crunching.

  • Third Parties Apis: These are third-party libraries that connect to services that exist in the network. For example:

    • Tweepy used to manipulate data from twitter.
    • Koutpy that connects to Klout
  • Session: This sub-layer is a module that is responsible for centralizing all necessary settings to access internet services.

  • Interpolation Normalization: This is a layer of abstraction for different interpolators which has scipy and defines some new all with the same API.

  • API Normalization: will convert all the answers of all Internet services and third party APIs to common structures (dictionaries) using if needed information contained in the session.

  • Information Sources: These are the classes that represent our sources. These are connected in a "auto-magical" way to the various standardized API's.

  • Theories: This layer has modules that define the behavior and the theory calculations implemented in the Infopython (for current version only Agenda-Setting). Each theory encapsulates the media of information in "nodes" which add data provided by this theory.

../images/infopython/nodes.png

Now defined all the theory, and the whole architecture, we can mention how is working with the library:

  1. Set up the session: is to provide the session layer all api key (authentication mechanisms of third party services) required.

Example:

from infopython import session

# List all MANDATORY keys from the library
session.NEEDED_KEYS

# set up the session with the keys v0, v1, ...
session.set (v0 = 1, v1 = 2 ...)

# Returns the value of a key
session.get ("v0")

# Delete the session
session.clear ()

In the current version all NEEDED_KEY are mandatory and the session is immutable.

  1. Create the media: Create the media on which you want to check its value. In this version of Infopython classes is provided for 2 media:

    • WebPages: It represents a web page regardless if it is a twitter profile or blog or whatever. Is suggested as mechanism of audience measurement services of Compete (http://www.compete.com/) or Alexa (http://www.alexa.com/).

      And as a mechanism to measure the impact Page Rank (http://es.wikipedia.org/wiki/PageRank), because if Google says the importance of the information is this, we are not going to argue with Google.

      WebPage Api Example:

      from infopython.isources import webpages
      
      webpages.WebPage google = ("google.com")
      
      google.id # return "google.com"
      google.url # return "http://google.com"
      google.html # The HTML content of "http://google.com"
      google.text # The HTML text of "http://google.com"
      
      google.get_info ("compete") # Compete information of
                                 # "Google.com" using the
                                 # key of compete provided
                                 # in the session
      
      
    • TwitterUser: It represents a Twitter user and not their tweets

      It is suggested as a mechanism for measuring audience the amount of followers, and for impact the information provided by Klout (http://klout.com/)

      TwitterUser API Example:

      from infopython.isources import twitteruser
      
      I = twitteruser.TwitterUser ("leliel12")
      yo.id # leliel12
      yo.username # leliel12
      yo.get_info ("tweepy") # tweepy information of the user
                            # "Leliel12" using the key of
                            # Twitter provided in the session
      
      
  2. Create Evaluators: Involves creating callables (functions or methods) to receive an information medium as a parameter and return the values shall assume as audience or impact. For example if we decide that our isource WebPage draw its audience of ** Compete ** and Impact by Pagerank the functions should be similar to these:

# Extract the unique visitors of compete from the WebPage you receive as
# parameter
aud = lambda w: w.get_info ("compete") ["metrics"] ["uv_count"]

# Extract the value of page rank of WebPage that you receive as parameter
imp = lambda w: w.get_info ("pagerank") ["pagerank"]

If none of the evaluators were supplied to the agenda, it will try using the supplied interpolators.

  1. Create interpolators: The interpolators are used as second alternative to extraction of audience and impact, so each agenda receives 2 interpolators: an audience and an impact interpolator.

Thus the impact interpolator will receive as value to interpolate "X" to the audience and will return a value "Y" for the impact.

Now, if we wish to interpolate the value of the Audience the interpolator will receive as value "X" the Impact and will return a "Y" for the Audience.

An example is shown below together.

  1. Create the agenda/s: When creating agendas they should be provided with different data:
  • What kind of information medium will measure.
  • A list of media to be measured (optional).
  • An audience data extractor (optional).
  • A impact data extractor (optional).
  • A audience interpolator (optional).
  • An impact interpolator (optional).

An example is shown below together.

  1. Evaluate nodes: The agenda has methods to sort ISources by value, only to be iterated and thus generate a ranking of importance for each medium.

By iterating on the Agenda it returns various ASNode which are data structures that encapsulate the media and add attributes corresponding to Audience, Impact and Value as well as date and time when the node was created.

More Methods of Agenda

Assuming we have an instance, the same agenda from the previous example ag and WebPage, google:

ag.value_of (google) # returns the value of google (audience + impact)
ag.impact_of (google) # returns the value of the impact of google
                  # Shall be given what we define as evaluator of
                  # impact would make the call:
                  # Return google.get_info ("pagerank") ["pagerank"]

ag.audience_of (google) # returns the value of audience of google
               # Shall be given what we define as an evaluator of audience
               # would make the call:
               # Return google.get_info ("compete") ["metrics"] ["uv_count"]

ag.wrap (google) # return a ASNode with the values of audience
               # Impact and value of information of google

ag.count (google) # Return how many times this medium is in the agenda

ag.remove (google) # removes first occurrence of google in the agenda

ag.append (google) # add google to the agenda

ag.for_type # would return which type of iSource this agenda was created for
            # WebPage for our example

ag.audience_valuator # None or function calculation of audience

ag.impact_valuator # None or function of calculating impact

ag.audience_interpolator # None or audience interpolator

ag.impact_interpolator # None or impact interpolator

Comparing 2 Agendas

In the Agenda module there is a feature that is useful for evaluating various agendas with different media.

This function returns a sorted list of ASNode for both agendas.

from infopython import agenda
from infopython.isources import webpages, twitteruser

# 2 agendas with different media types.
AG1 = agenda.AgendaSetting (iSource = webpages.WebPage)
AG2 = agenda.AgendaSetting (iSource = twitteruser.TwitterUser)

# Iterate over all the media of both agendas
# Sorted by 'value'.
for i in agenda.rank_isources (AG1, AG2)
   print i

Final Note: Test

Upon downloading of the library at first thing to do is run the test according to the following steps:

  1. Run

    $ python setup.py test

  2. Configure test.cfg with the keys of the corresponding API's.

  3. Run now if

    $ python setup.py test

Conclusion

As we saw Infopython provides a uniform way of assessing the information. Future versions plan to introduce other types of mass-media since for example, IMDB and GoogleBooks provides information via API's of traditional media (movies and books) or, going further, LinkedIn fairly reliable information about job profiles.

It is also possible integration with natural language processing NLTK or a semantic web tool.

Links:

Help PET: Donate

blog comments powered by Disqus

Last Change: Sat Jul 9 15:00:35 2011.  -  This magazine is under a Creative Commons license