Tuesday, March 7, 2017

Text Mining the Bible

Introduction

What is the Bible about?  There are many ways one could answer this question, and even more ways one could go about finding an answer to the question.  One could read the Bible or read books about the Bible.  One could look at the scriptural text, the theological formulations, the historical narrative, or adopt another lens by which to make sense of what the book as a whole is all about.

In what follows, I offer a data science approach to begin to answer this question through the use of keyword relevance analysis and data visualization.  Such an analysis can provide quick insights into the key words and themes that can then be understood visually.  While extremely simplistic and nowhere near as rich, complete, or meaningful as actually reading the Bible or reading books on the Bible, such an analysis can provide guidance as to what is important.  I detail the method and show the results of this analysis in what follows.

The Method

I wrote a Python script that scraped the Biblical text book by book, chapter by chapter from Biblegateway.com.  I selected the NRSV translation and used the Catholic selection and ordering of books in the Bible.  Once the book text was parsed and cleaned (ok, not perfectly, but good enough for demonstration purposes), the whole book was submitted to the Alchemy API for keyword relevance analysis.  The output of the API was a keyword (or key phrase) (e.g., "God") with the relevance or importance of the keyword within the context of the book (e.g., 0.95).  The higher the relevance, the more important the keyword was.  After compiling the results, I had a list of about 50 keywords by relevance for each book of the Bible.

I took the results and brought them into Tableau for interactive filtering and visualization.  Screenshots below of the results are taken from the resulting workbook.

Here are some links for those who may be interested regarding the nuts and bolts of the analysis:
Here are some high level results of the analysis with some simple conclusions.

Results: Most Import Keywords

So what is the Bible about?  From the above method, we can look at keywords and phrases by total counts and relevance.  We'll look at these in turn.

Total Counts

What are the most common keywords?  With 73 books in this Biblical list, a keyword could have a total count of 73 if it was a keyword in every book of the Bible.  According to the Alchemy API, the top 10 keywords by total count in the Bible are:
  1. God (48)
  2. Lord (46)
  3. people (44)
  4. house (26)
  5. things (22)
  6. Lord God (22)
  7. Israel (22)
  8. son (21)
  9. land (20)
  10. king (19)
Not surprisingly, the Bible is about "God" or the "Lord".  It is also about a "people", namely, the Israelites, and the story of their establishment in the "land" of Israel.  Much of the Old Testament talks about the various rising and falling of "king"s and the establishment of the "house" of the Lord (i.e., the temple) or being a member or descendant of a "house" (e.g., the house of David).   As lineage is important, "son" is used a lot to explain ancestry, and in the New Testament, this word takes on a special meaning as part of the phrase "son of God". 

Here are the top 3 keywords (God, Lord, people) over the range of the Bible.  While not categorized as relevant in every book of the Bible, "God"/"Lord" is very relevant throughout most of the Bible.  The word "people", though less relevant on average, is also found throughout the Bible.  This would suggest that the Bible is fundamentally about "God" and the relationship of "God" to a "people".




Noticeably absent from this list, is "Jesus", whose top entry is "Lord Jesus Christ" tied at #12 with a total count of 15.

Maximum Relevance

What are the most relevant keywords?  A keyword may only occur a few times but be extremely relevant in context, while another word may occur more often but may not be relevant at all.  So while total count does give an indication of relevance, it is not exhaustive.  According to this Quora post, the Alchemy API calculates relevance by using word position, context of other words, how many times it is used, and other statistics.  However, the specific details are not documented anywhere that I could find.

What are the most relevant keywords/phrases?  I looked at the maximum relevance for a keyword in the whole Bible.  In order of most relevant, the top 10 keywords/phrases are:
  1. Lord Your God (0.998)
  2. God (0.996)
  3. beloved speaks (0.992)
  4. son (0.990)
  5. savior Jesus Christ (0.989)
  6. shall (0.989)
  7. Lord (0.988)
  8. Judas (0.988)
  9. Jesus Christ (0.984)
  10. Holy Spirit (0.984)
Again, we can infer that the Bible is about "God", or more specifically (as addressed to the Israelites) the "Lord Your God".  "Beloved speaks" is interesting in that it comes from the Song of Solomon, in which Solomon writes poetry to his "beloved".  We also see again the importance of "son".  But now we also see "Jesus Christ" enter into the picture as important, as is the "Holy Spirit".  Thus, each person of the trinity is considered relevant.

The occurrence of "Judas" is also very interesting.  However, it is not a reference to the Judas who betrayed Jesus, but a reference to "Judas Maccabeus" from the book of 1 Maccabees.  The word "shall" comes from the book of Micah, who is a prophet and speaks of many things that "shall" happen.

Here are the top 5 (Lord Your God, God, beloved speaks, son, savior Jesus Christ).  We have seen that "God" is used throughout the Bible.  "Lord Your God" is used primarily in the early part of the Old Testament and then again in the prophets.  The word "son" is used heavily in the historical books of the Old Testament, and then again in the 4 gospels: Matthew, Mark, Luke, and John.  "Savior Jesus Christ" only makes one appearance in 2 Peter.




Total Counts and Maximum Relevance

We have seen that some words have high counts but lack relevance, while others have relevance but lack high counts.  Which keywords have both?  I increased the Total Counts filter and the Max Relevance filter until I had 10 keywords.  Choosing different thresholds would result in a slightly different combination, but my selection requires each keyword to have a max relevance above 0.90 and a Total Count greater than 12.    This results in the following list:
  • Father
  • God
  • Holy Spirit
  • Israel
  • Jesus Christ
  • King
  • Lord
  • Lord God
  • Lord Your God
  • Son
We have seen most of these already.  The exception is "Father".  It is a keyword 13 times and has a max relevance of 0.933.  It occurs throughout the Bible, but reaches a high point in Matthew.   Matthew opens up with a genealogy of Jesus in the form of "X was the father of Y, and Y the father of Z, and Z...", and this goes on for many lines. 

Most Relevant Keyword In Book Total Counts

Which words are consistently the most relevant in a book?  If we find the most relevant word in a book and then find the total counts of these words, which ones are on top?  The top 8 (those occurring more than once) are:

  1. Lord (17)
  2. God (14)
  3. Son (4)
  4. Christ (4)
  5. King (4)
  6. Christ Jesus (3)
  7. Jesus Christ (3)
  8. Jesus (2)
Again, "Lord" and "God" are consistently the most important.  We have also seen "son" and "king" before, as well as "Jesus".  Of note is that "Christ" now appears on its own in addition to being attached to Jesus as we have seen before.  The word "Christ" means "Messiah" who is the awaited savior and king of the Jews.  So there is the notion that Jesus is the "king" and savior of the Jewish people.

Keyword Category Total Counts

Many of these keywords could be further categorized or similarly grouped.  For example, "Jesus Christ" and "Christ Jesus" could be grouped with "Jesus".  After performing such groupings, what are the total counts now?  The top 10 are:
  1. God (228)
  2. Lord (150)
  3. Family (149)
  4. People/Israel (146)
  5. Jesus (117)
  6. Virtue (102)
  7. Covenant/Law (61)
  8. Body Part (58)
  9. Land/Earth (56)
  10. Humans (51)

"God" and "Lord" are still at the top, but next we have familial terms like "father", "son", "brother", "sister", and "child".  Then we have terms related to "Israel" or "People".  Next, any terms related to "Jesus".  After that, terms of virtue: peace, love, mercy, faith, hope.  Covenant/Law refers to terms related to the Old Testament law: priests, offering, covenant, and law.  Body Part refers to a mention of a body part like hands, heart, and eyes.  Land/Earth refers to the land, Earth, world, hill country, or some other generic description of location.  Lastly, humans refers to any non-familial term for humans: woman, man, and young men.

So we can see that this grouping still characterizes the Bible as about God, but perhaps it is about a relationship to people spoken of in largely familial concepts.  At the very least, family relationships are important and spoken of often.  Jesus is important too (more on that below).  A life of virtue in reference to (or perhaps in contrast with) the law/covenant of Israel is how one ought to live.  Such a vision has a broad scope, applying to the land and the Earth and humans of every kind on the Earth.

Specific Comparisons

Other specific comparisons could made of keywords and categories, but here are two interesting ones I saw.

Trinity: God/Lord, Jesus, Spirit/Holy Spirit

As we know, "God"/"Lord" is important throughout the Bible.  Once entering the New Testament, "Jesus" becomes equally important.  "Holy Spirit" has lesser importance, but is still prominent in the New Testament.  It is highest in the book of Acts, when it is more relevant than either "Jesus" or "God".



Virtue, Covenant/Law, Sin

The categories of Covenant/Law and Virtue seem to trade off in relevance.  In the beginning of the Bible, Covenant/Law keywords are most relevant.  In the prophetic books, Virtue keywords are more relevant.  In the beginning of the New Testament, both are important, but Covenant/Law is emphasized over Virtue. For the rest of the New Testament, however, Virtue is most important.  "Sin" appears here and there, but is only ever about as relevant as Virtue.


Other Insights?

While I could go on and on with more insights, I will leave that to you!  I have embedded the Tableau dashboard below.  You can also go here to view and use the dashboard.


Conclusion

So what is the Bible about?  From this simplistic keyword and relevance analysis, we can sum up the above findings and say that the Bible is broadly about:
  • God (the Lord)
  • The history of God's interaction with the people of Israel through the use of law and a covenant.
  • The coming of Jesus, who is considered to be the Christ and king of the Jewish people, followed by the coming of the Holy Spirit.
  • A shift towards a life focused on living by virtue instead of living (merely) by the law.
Much more can obviously be said here, but I think the main goal has been achieved.  That is, I have shown that a keyword and relevance analysis using data science methods (e.g., web scraping, APIs, data visualization) can reveal much about the important themes in a text, in particular, the Bible.