Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Cragady/cnility

 
 

Repository files navigation

Overview

I took on this project of extracting HTML and text from a PDF as an exercise of curiosity.

There's likely better ways to handle what I'm doing in this repo, but I like poking around things like this.

One thing to note: DO NOT PARSE OUT WOFF FILES AS THEY STAND. For some reason, it seems to change the locales or something.

Special characters are printed to the console that will ruin the output. I don't know how to fix this at the moment.

It's sort of like how you can change the terminal colors by printing special characters, and you have to print a reset char to set the color back to normal. It's like that, but with characters instead. It's quite something to look at.

The following is an output of the ls command in this repo after printing the WOFF files.

51Q┤▒┌␋├≤S▒└⎻┌␊.⎻┼±   README.└␍   R␊⎻┌␋␌▒C⎺┴␊⎼F␋┼▒┌.⎻┼±   ␉␋┼␍␋┼±.±≤⎻   ␉┤␋┌␍   ␌⎺└└⎺┼.±≤⎻␋   ␌⎺└⎻␋┌␊ ␌⎺└└▒┼␍⎽.┘⎽⎺┼   ␌⎺┼┴␊⎼⎽␋⎺┼.⎽␤  '┐&⎼ ┴0⎼▒±├┤│8.⎻␍°'   ┼⎺␍␊ └⎺␍┤┌␊⎽   ┼⎺␍␊└⎺┼.┘⎽⎺┼   ⎻▒␌┐▒±␊-┌⎺␌┐.┘⎽⎺┼   ⎻▒␌┐▒±␊.┘⎽⎺┼   ⎻␍°2␤├└┌EX   ⎽⎼␌   ├⎽␌⎺┼°␋±.┘⎽⎺┼   ┴␊┼┴

Prolly not a huge deal if you do print them, but just thought I'd let you know. A cheeky reset seems to fix the problem.

Maybe this is my lack of experience with working with fonts on a low level leaking through. Maybe this is obvious to more seasoned programmers: don't print font files maybe?

What I do know about these fonts though, is they transform three-bytes utf-8 sequences. This is the reason behind FileConversion.cpp in the first place: to read the three-byte utf-8 sequence and transform it back into a normal utf-8 sequence. The next step for this repo is to read the WOFF files and apply similar transforms there. There may be a need for another font, or it could be as simple as changing the font size on certain characters. I'm unsure, but the WOFF files have a different set of glyphs for A-Z for some reason. Likely just for sizing.

Usage

You will need to run this on a system where you can #include <zlib>. For me, this was Ubuntu under WSL.

  • npm run dev-c # this will compile the C++ and run tsc & nodemon
  • Go to localhost:3000
  • Click on the Convert Kandr Chars
  • Visit localhost:3000/kandr/parsed
  • OR: Look in ./src/kandr/parsed to see parsed HTML files

You can verify parsing went correctly by looking in the source of localhost:3000/kandr/parsed and seeing readable text in the HTML, unlike localhost:3000/kandr which will be a garbled mess most of the time.

There is no need to run pdf2htmlEX since I've already taken care of that step. conversion.sh won't help you unless you have the pdf2htmlEX binary in this dir to run.

Dependencies

Right now, these are notable dependencies due to them needing to be excluded from git tracking or vendoring into the project.

Just make sure zlib is in your system somehow. e.g. On Ubuntu: sudo apt-get install zlib1g (or whatever it is) if needed.

For pdf2htmlEX grab the appropriate binary from their releases page and put it in the project's dir following .gitignore's nomenclature.

Or put it wherever. The world is your oyster.

Conversion Completed

As of right now, the conversion is complete. I still need to go through and see if layouts and sizing needs fixing.

// TODO: verify layouts and sizings - see if needs fixing
// TODO: verify ligatures handled properly
// TODO: look for patterns in HTML and parse the data out to a digestible format
// TODO: parse previously mentioned digestible data into HTML
// TODO: address inherited issues
// TODO: (if wanted) touch up GUI for basic controls

Instead of parsing the fonts with C++, I've decided to just use FontForge to fix the fonts and add ligatures where necessary. I may make a separate project where I read WOFF, OTF, and/or TTF files with C++ using some of the code I wrote here. I likely won't let JS or TS touch that project.

POST RAGTUX

Unfortunately, we don't have the LaTeX source for this PDF, but this project is meant to alleviate the lack of a LaTeX source by parsing out the PDF to readable HTML.

There is likely a text source somewhere else out there, but as I've stated previously, this is a project that is an exercise of curiosity. I may attempt to read the PDF directly as a nightmare mode challenge if I'm feeling super up to it.

Anything past this point in the README was written by ragtux.

K&R 2E

Welcome to the unauthorized K&R 2E repository! K&R is an amazing book both in terms of its lasting historical impact (it's 50 years old!) and in the timeless quality of its technical writing. It is truly a computer science classic.

The impetus for this project was my frustration with the seemingly non-existent good quality (typeset, non-scanned) pdfs available. I had purchased both the first and the second edition of K&R and was looking to buy a good quality (typeset) pdf for digital reference but couldn't find one.

The typesetting is entirely LaTeX. The advantages of LaTeX is that - if organized well enough - one can easily "rice" a doument. Most of the graphics were done in Inkscape. There are plenty of drawing packages in LaTeX (e.g. TikZ) but I am largely unfamiliar with them. I would love to learn more about LaTeX and include them if anyone can provide good looking working examples.

quality sample sample cover

I did this originaly using XeLaTeX which allows the use of system fonts. Since some of the fonts used are non-free. I am currently working on coverting everything to a free font so that everyone can work on it.

Project Goals

There are two objectives:

  • A "vanilla" version that is as true to the original K&R 2E (stock size, fonts, layout, coloring, etc.) as reasonably possible.

  • A "super deluxe" version inspired by what the guys here did with SICP. I really enjoy the font combo and the syntax highlighting. I would also like this deluxe version to have all its exercises hyperlinked to the answers which will be at the back of the book. The answers will come from the second edition of Clovis L. Tondo's official answer book.

TODO

  • Complete Appendix A, B, and C
  • Organize source files and push to repository.
  • Thorough layout error checking. These are hopefully little things like somethings wasn't italicized when it should have been, or something was indented inconsistently.
  • Transcription and testing of Clovis L. Tondo's official answer book (2e) for use in the deluxe version.

Legal

I do not own this work, nor do I ask for money. Please support the publisher and authors by purchasing an official copy.

That being said this work is nearing 50 years old, and C has had its time in the sun. A quick search on Google or GitHub will reveal many other bootleg copies of the book. So this project shouldn't stand out in that sense.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 95.6%
  • CSS 3.7%
  • C++ 0.4%
  • TypeScript 0.2%
  • Shell 0.1%
  • Python 0.0%