So I'm writing a script that essentially allows people to enter a url of their writing profile on a particular site. The script will then give them a list of articles that don't meet their own quality specifications and also provides the writer with fun facts about their writing. Users can customize some pre-set specifications. For example, the word count specification, which defaults at 1000 words, can be set to equal 700 words. Then, the user can see a list of articles they've written that are under 700 words. In addition, the script will show users their average word count across their entire account.
I have a few problems with the script. One of the larger problems is that for users with large account (500+ articles), the load times are ridiculous. I would like to kind of disguise the load time by offering information about articles as the script is running. As is, nothing is able to show up until the script has completely run.
I would like the script to show on-page, the current average(changing with each article), the number of articles the script needs to comb (or at least a % complete), and a list of the article that don't meet a particular qualification as it fails.
Currently, my script takes ALL urls and opens them, puts the words into a hash of arrays and does word count calculations from there. Since much of the load time is in opening urls, pulling and cleaning text, my current design does not really allow for giving information during load time.
Here is the current script.
#http://www.codegurl.com/2012/02/disguising-load-times-w-information.html | |
require 'nokogiri' | |
require 'open-uri' | |
def create_base_url(username) | |
#takes hub username and turns it into url using 'lastest' | |
base_url = "http://#{username}.hubpages.com/hubs/latest".to_s | |
end | |
def get_index_pages(base_url, username) | |
index_pages = [] | |
doc = Nokogiri::HTML(open(base_url)) | |
range = doc.xpath('//span[@class="range"]').inner_text | |
#finds number of hubs in 'range' string | |
str_array = range.split(' ') | |
number_of_hubs = str_array[2] | |
#strips out unnessesary info from range string and returns number of hubs | |
number_of_hubs = number_of_hubs.to_i | |
number_of_index_pages = number_of_hubs / 10 + 1 | |
#finds the number of index pages, 10 hubs per page, one extra page for remainder. | |
while number_of_index_pages != 0 | |
number_of_index_pages = number_of_index_pages.to_s | |
index_pages << "http://#{username}.hubpages.com/hubs/latest?page=#{number_of_index_pages}" | |
number_of_index_pages = number_of_index_pages.to_i | |
number_of_index_pages = number_of_index_pages - 1 | |
end | |
return index_pages | |
end | |
def get_hub_urls(index_list) | |
hubs = [] | |
index_list.each do |something| | |
doc = Nokogiri::HTML(open(something)) | |
doc.xpath('//div[@class="hub_pic"]/a').each do |e| | |
hubs << e['href'] | |
end | |
end | |
return hubs | |
end | |
def pull_text(hub_urls) | |
hubs = Hash.new | |
results = [] | |
hub_urls.each do |something| | |
doc = Nokogiri::HTML(open(something)) | |
main_text = doc.xpath('//div[@class="module moduleText color0"]').inner_text | |
blue_text = doc.xpath('//div[@class="module moduleText color2"]').inner_text | |
grey_text = doc.xpath('//div[@class="module moduleText color1"]').inner_text | |
table_text = doc.xpath('//div[@class="module moduleTable color0"]').inner_text | |
title = doc.search('title').inner_text | |
all_text = main_text + blue_text + grey_text + table_text | |
hubs[title] = all_text | |
end | |
return hubs | |
end | |
def clean_text(hubtxt_hash) | |
hubtxt_hash.each |key| | |
key = key.delete(",").gsub(" ", ",") | |
key = key.delete("\n").split(",") | |
return hubtxt_hash | |
end | |
puts "Enter HubPages username:" | |
username = gets.chomp | |
base_url = create_base_url(username) | |
index_pages = get_index_pages(base_url, username) | |
hub_urls = get_hub_urls(index_pages) | |
text = pull_text(hub_urls) |
Another problem is the way in which I count words. It's way off. I've got to work on that, but I've already got a solution in mind. I just need to implement it.
Edit: Unfortunately, at this time, HubPages is testing several different layout changes. Because of the way in which I wrote the code (picking out bits of CSS), I've decided to hold off on this project. When HubPages calms down with the design changes, I will continue with the project. See you then! (June 2010)
0 comments:
Post a Comment