GOCR for Windows From Linux

My last post was a major screw up. I admit it. Maybe I was high, but whatever I posted some shit, someone called me out on it. Ah well, big deal. I post some good stuff I reckon. Totally for free. If you can’t read through some of my bullshit to get to it then I will clean you up the proper info for a small fee of $600. But I guess you’re all just going to have to read my shit along with the good stuff. Am I retarded? Hell yeah. But I don’t give a fuck.

Anyway instead of rebooting your pc everytime you want to compile a program for windows it’s actually possible to compile straight from Linux with wine. This includes ImageMagick programs. Now I’m not sure but I think some other folks have tutorials on similar stuff but it’s a pain to get imagemagick included as well. Here’s how you do it.

Go download dev-cpp we’ll use this as our IDE

Now go get the latest MingW. You’ll know you’ve got the right one because in dev-cpp/bin/ you’ll find a load of same named files as in MinGW/bin/.

Copy the files from MinGW/bin/ over the dev-cpp/bin/ directory. This is basically an update of MinGW. Oh yeah copy over the libraries (lib/) too, I’m pretty sure you need those. Something about gettimeofday() not being present in older versions of mingw.

Download the ImageMagick for MinGW install. It’s listed under the unix binaries but runs on windows. No idea why. Anyway this will install a directory with some libraries in it. Libraries are files that have an .a extension like libMagick.a

Ok right if you’ve done all that you can go ahead and run Dev-Cpp. Like this (Make sure you’ve installed it using wine:

user[~]$ cd .wine/drive_c/Dev-Cpp/

user[Dev-Cpp]$ wine devcpp.exe

and make a commandline C++ project and set the include directory to C:\ImageMagick…\include\ and add these three libraries in this order:

C:/ImageMagick-6.3.7/lib/libMagick++.a
C:/ImageMagick-6.3.7/lib/libWand.a
C:/ImageMagick-6.3.7/lib/libMagick.a

Change Include DirectoryAdd Libraries

That will now compile perfectly but when it comes to linking it will complain that there is a billion functions missing. That’s because they didn’t include all the other libraries you need. Which is what this tutorial is really about. Basically I went in search of them all and now I’ve got them I’ll zip them up for everyone. The other files you need are (you can probably just paste this into your project properties):

C:/Dev-Cpp/otherlib/libfreetype.a
C:/Dev-Cpp/otherlib/libjbig.a
C:/Dev-Cpp/otherlib/libjpeg.a
C:/Dev-Cpp/otherlib/liblcms.dll.a
C:/Dev-Cpp/otherlib/libpng.a
C:/Dev-Cpp/otherlib/libtiff.a
C:/Dev-Cpp/otherlib/libtiffxx.a
C:/Dev-Cpp/otherlib/libz.a
C:/Dev-Cpp/lib/libgdi32.a

Incidentally Dev-Cpp does come with libgdi32.a, which is handy. :D

Here’s the libraries. Enjoy.

Other Libraries You Need

All my programs so far in c++ will compile to a windows .exe from linux using this method. I find it handy. Especially as windows refuses to install because linux is on my first hard disk.

August 4th, 2008, posted by Harry

The most fucktarded content rewriting

So what’s a good idea not to do when you scrape someone’s content. Link to them? Yeah that’d be a good start.

http://seosandbox.com/2008/07/25/user-contributed-captcha-breaking-w-phpbb2-example/

Type this twat’s URL in and you’ll have flashbacks to March 2008 and a post on bluehatseo.com except someone swallowed a very small thesaurus. Bluehatseo.com got ddos’d for posting that article. I think this guy should get a dose of the same.

On the plus side he didn’t use markov’d content, he simply replaces certain words with others. He doesn’t reorder the sentences. What we need is a site with a captcha that says “write this sentence to mean the same but in a different way”. Put some porn on the site, take the answers and use them in our content rewriter. Filter for dodgy words etc etc.

Anyway I was gonna make a post about how to compile windows programs from linux as I’m getting success with it now, but this guy made me hit the trigger button.

July 30th, 2008, posted by Harry

Money.co.uk Keeping up?

You all been keeping up with money.co.uk? They have had a couple of bits of linkbait out there that I’ve seen. Maybe you’ve seen more. I wonder if it worked?

Money search

Click the picture to see a search on google.co.uk for money. They’re not first. I think they’re like 5th. But who knows what else they have to turn loose upon the search engines?

July 25th, 2008, posted by Harry

PHPBB3 Code

Windows is a total pain. I spent ages trying to get this phpbb3 code to compile on it. The code is messy as anything. It was written pretty fast just to do the job. I might release piratebay if people are interested which has a lot cleaner code in it. I actually used separate modules :D .

The code just runs through the entire phpbb3 captcha and fills in areas until it finds a small area. It assumes that any small area must be part of the letter. It then blurs all the little squares together, spaces the letters out better, rotates them and dumps a file that can be read by gocr.

The windows executable barely compiled. Hopefully it works, but since I don’t use windows. I have no idea really :D .

Here it is. PhpBB3 Crack tool

July 21st, 2008, posted by Harry

Would you like to be showered with quality links?

If you would… Here’s the plan.

Make some ridiculously shit web page that does nothing except it looks cool. Maybe you answer a test and get some shit answer out. Make sure whatever it is people want to include some kind of pointless widget on their blog. Whilst you’re there squeeze a sneaky link back to your real site in it.

Then you’ll be wanting to buy or bully your way into a review on a top site. Let’s say… John Chow… That sounds cool.

Oh no wait. It’s just been done. 

July 15th, 2008, posted by Harry

Utilizing ANNs More Efficiently

It has come to my attention that GOCR has its shortcomings. :D The problem is that small adjustments in pixels from surrounding noise cause it to recognize h’s as b’s and little things like that. Most of these OCR packages were never designed to learn new alphabets. If you open the source code to OCRAD up you’ll see that it breaks the letters down into a list of features which are then used to assign a probability as to the most likely letter. None of it is trained. It is all pre-planned and hard coded into the program. I’m not 100% sure how GOCR training works but I think it happens at the pixel level.

Now a while back I did a post on training a neural network at the pixel level to recognize characters. It took a long time because it was php (please use the C++ libraries for training unless you have a lifetime to train the neurons) but it started to work, although not as accurately as GOCR. The problems were obvious things like h’s get picked up as b’s again. You can see why the neurons failed to recognize the character.

The nice thing about neural networks is they’re pretty simple to use once you get to grips with the number of layers you might want and so on. The other nice thing is that we can stack them together in a similar way to how a full adder works. I.e. you can pass a carry flag from one adder to the next.

So the reason our neural nets are failing is because the pixels differ a little here and there and without some knowledge about exactly how a letter is formed it’s difficult to know which letter it is. So if we take a step back from the pixel level there’s a couple of things we can analyze. We can look for hills & valleys. Like an ‘n’ has a space inside it. If we calculate the base of the text we can also identify if the word has lines that dip below the baseline or go very high above it. Just using these features we can train a neural net to guess at a range of characters. Then we can feed the output into our second neural net which works at pixel level or concatenates the output of a couple of other nets.

The idea behind this is that feature extraction is a proven technique that gets very good results until the characters deviate from the norm. Using multiple nets we should be able to combine the ability to train a new alphabet with the power of feature extraction.

July 11th, 2008, posted by Harry

Record search information

Wouldn’t it be cool if you knew all the searches they typed into google even though it’s not on your domain. With Internet Explorer 7 now you can! I haven’t checked this vulnerability out properly so I’m not quite sure of the details but it looks pretty severe.

I think you may need to have your page still open for this to work, so navigating away *may* destroy the recording, but if they click open a new window then it should work I think. The scammers will be having a field day anyway.

Click Here unless you use Internet Explorer in which case…

Click here for Disney and ignorance

July 3rd, 2008, posted by Harry

Anti-Cookie Stuffing

I feel like writing right now. Weird. But anyway. Cookie stuffing works. Cookie stuffing is on the limits of even my ethics :D . Cookie stuffing should be solved. Why don’t the big companies seem to care? Maybe they make too much to even notice, they just factor it in as an inevitable loss with affiliate marketing. On to solving. Cookie stuffing can generally be done two or three ways.

hidden IFRAME

cross-domain browser bug

an image pointing to site with affiliate URL

If there is a cross-domain javascript browser bug then you are fecked. Nothing you can really do to solve that.

Ok so the easy one. An image is the easiest way to cookie stuff, you can do it on any forum, blog, etc. It’s easy. Instant traffic with no work. But. It’s a CSRF (cross-site forgery request)! It simply needs a two-step process from the affiliate merchant’s website. He loads the affiliate url but then on that page is another HTTP request to a token based on the user’s session cookie. Remember we can’t read a cookie and we can’t place a cookie because it’s not on our domain. All we can do is basic HTTP requests. So if the tokenized URL is never loaded then that must mean that page has never been parsed by a browser so we don’t give the user a cookie. Simple.

Ok now before I started writing this I thought the IFRAME method could never be algorithmically detected. You’ve got the obvious checks such as checking referrer URL that sends a bot to make sure the page isn’t breaking rules, but that’s a laborious process. However if I remember correctly according to browser security rules it is ok for an IFRAME to read information about the parent frame from javascript but not the other way around. So if javascript is enabled then it should be easy to check that the page has not been IFRAMED. If it has been IFRAMED that’s a big red flag but I think javascript can also test the IFRAME to make sure it conforms to the rules right there and then.

Cookie stuffing solved. Anybody going to do anything about it?

June 29th, 2008, posted by Harry

Computer Recognized Photographs

On my wild and wacky adventures on the Internet, this is pretty amazing.

http://wang.ist.psu.edu/cgi-bin/zwang/alip_result1.cgi?test=1

It’s a computer program categorizing images based on statistical probabilities. Now where’s the download source code button? Darn those people!

June 28th, 2008, posted by Harry

Indian Digging

I’m pretty busy working on some code, hence the minimal posting. But I was thinking last night about getting a power user digg account without actually having to go and interact with the community because that is horrendously boring. Indians are pretty cheap right ;) . I wonder how much it would cost to get them to build you a power user account on a social bookmarking site. Just layout a plan to follow every day listing how many hours to do each task for and then pay them by the hour.

Anybody done this? Is it economically sound? You’ll be paying for an asset (or a liability depending on if it works :D ), the account, that you can use more than once unlike when you purchase diggs.

June 24th, 2008, posted by Harry