Introduction and Background
Before you run away just by the title, I’m not a crazy religious person!
(***Please note: I’m not saying religious people are crazy, I’m just saying I’m not a religious person who IS crazy).
Okay, with that said, as promised (I think I’ve mentioned in previous posts and the newsletter), I’m trying to get more into machine learning, namely neural networks, and even more specifically, deep learning surrounding Natural Language Processing, as there is a lot of literature on neural networks in the direction of NLP and a lot progress has been made in recent years in the space.
While I’m learning how to develop machine learning solutions, I’m not totally convinced that the solution is always “just throw it into a neural network to get a solution!” I think that rule based processing, and decision trees that then lead to neural networks (or perhaps a decision tree totally without a neural network) can be the best solution, especially when a decision tree or rules can point to an optimized neural network based on the task/question/command issued when it comes to natural language processing.
My search for more powerful NLP sentence generation had its beginnings in my other post about creating a code assistant, when I realized rule-based printing could only go so far, and printing static blocks of code was especially difficult (and annoying) to code.
So I began a hunt for a way to print blocks of code based on a command input - word and character level RNN can accomplish this, but since it was my first time working with machine learning and neural networks, I decided to start with something very simple to get used to the workflow. This example, in the end, is just a sampling method, and no querying of specific structures is possible. However, I intend to use what I learned here to revisit and greatly improve my code assistant.
So, as my very first real use of ML in the NLP space (or ML at all for that matter!), I decided on using a neat character-based recurrent neural network (RNN) that I found originally from this article about the ‘unreasonable’ accuracy of recurrent neural networks by Andrej Karpathy. [He supplies a link to the original GitHub code here.] (https://github.com/karpathy/char-rnn) After some inspection, and trying to install all the libraries myself: torch (successful) Lua (successful), and HDF5 (for some reason a giant fail :sad:), I ultimately end up running docker images for the code.
For my post here, I decided the least I could do was modify the corpus and share my findings (yes, sorry! - I didn’t write the RNN algorithm myself, if you want to get into the technicals, Karpathy also has a great post titled Hacker’s guide to Neural Networks. I want to be very clear, I would say about 90% of the credit for the neat content found in this post goes to Karpathy and Johnson for creating the codes, I simply modified some of the network parameters and built a bible corpus 1.
Data Preprocessing
The dataset comes from Kaggel, and was in a very nice CSV format. I’m going to leverage panda’s CSV import again, as I did for the South Park Markov Chain project, to clean and set the data to one file (If you’ve been reading through the posts here on NLP Champs, I used the same technique for the South Park Markov Chains). I decided to keeping the ‘BOOK’ / ‘CHAPTER’ names - then it would be a good test to see if the neural network truly extracts the ‘features’ of the bible corpus - the book and chapter titles being some of those features.
import pandas as pd
df = pd.read_csv('./t_kjv.csv'); # read in data from Kaggel
file = open('bible-text.txt', 'w') # open a file to write to
# for tracking current book / chapter
iCurrBook = 1
iCurrChapter = 1
file.write('BOOK ' + str(iCurrBook) + ', CHAPTER ' + str(iCurrChapter) + ':\n') # very first line
# loop through all rows
for index, row in df.iterrows():
if iCurrBook == row['b'] and iCurrChapter == row['c']: # if we match both the current book and chapter
file.write(row['t'] + '\n') # just write the text
else: # else, its a new beginning! a NEWWW BEGINNING!!!!
if iCurrBook != row['b']:
iCurrBook = iCurrBook + 1 #book increment
iCurrChapter = 1 # and reset the chapter to 1
elif iCurrChapter != row['c']: # otherwise it is just a chapter increment
iCurrChapter = iCurrChapter + 1
file.write('\n'); # new line to seperate chapters
file.write('BOOK ' + str(iCurrBook) + ', CHAPTER ' + str(iCurrChapter) + ':\n')
Results
3 Layers, 512 Neurons each
So, what you’ve all been waiting for - what does our creative new bible look like? Here’s some raw, totally unmodified samples the genesis corpus with 3 hidden layers and 512 neurons each:
BOOK 1, CHAPTER 1:
And he said unto his father, the LORD return unto the hand of gold;
And the LORD made an healen with Rebekah Isaac; in the earth, and she conceived again, and bare Reuel, where the man; and he served thee, each men, meat for him contook unto you officer with him but amongst hundred yied in the camel: he is upon Pthand created.
And take to his ancead, and builded her mouth.
And Isaac inkedom.
This one starts of strong, then… kinda crashes :sad:.
BOOK 1, CHAPTER 31:
And he put thy name? And he said, What seed to his brethren knew that the cry white far from thence give it, and, behold, my father wept.
And he called it the land of Canaan his son.
But thou shalt be a firmament of her father's house, even every creeping twoth bring, for thou hast were three renoth, and took Loseph; and he smote the daughter Jacob.
And he ruled oxen?
And Reusled hath heard it.
And he set them, and fal, and Tessed sethith, and thy cift, and ingread him.
Did he? Did he rule oxen?
So, already the model has learned how to “spell” words, with most of them correct, aside from proper nouns like names and places. It’s also picked up key features such as the LORD in all caps, the BOOK XXX, CHAPTER YYY: that I built in to the original corpora, as well as sentences that probably appear very often in the genesis such as ‘unto them’, ‘and he said’ and a lot of ‘And’ to start the sentences. However, note that to find these samples I had to call up about a total of 50,000 (approximately ) and hunted down some of the one that made the most sense, so it was clear that I needed to train longer
3 Layers, 1024 Neurons each
I beefed up the neural network with more neurons, here’s what I got:
Coming soon!!! (I know, total buzzkill 😂😂😂)
Conclusions and Notes
Some conclusions and notes as I move futher into the neural network / deep learning space:
- A character-level RNN, while perhaps more powerful that a Markov model, still shows difficulty with grammar, even when the ‘loss’ function is very low.
- A LOT of ML algorithms are on Github and open sourced - you just need to provide the data - so if you were of the mindset that NLP algorithms are heavily proprietary and locked away behind institution walls, well, I don’t think that is the case anymore
- Even a simple model such as this one takes a very long time in CPU mode (remember - 4 i7’s running at full blast 🚀🚀🚀 took about 28 minutes to train a 183 Kb / 1583 line corpus) from the documentation I’ve read, a GPU would be about 10x faster, so only 2.5 minutes, not so bad!
- Ugh, I want a GPU!
- I REALLY want a GPU!
Gimme Code!
As is tradition, here are main codes for this project (as always, codes are linked directly from the current source on GitHub - the github repo itself can be found here). Remember, don’t forget to check out the original min-char-rnn by Karpathy, the improved, more optimized version by him on GitHub, and the even further improved version by Justin Johnson on GitHub and the Docker Images by Cristian Baldi .
main.js
(also coming soon, lulz)