REBOL [ Title: "HTML Tools" Date: 18-Mar-2001 File: %html-tools.r Author: "Brett Handley" Email: brett@codeconscious.com Purpose: "Provide some functions to interpret/manipulate html." History: [ 1.0.0 [10-Oct-2000 { Original code } "Brett Handley"] 1.0.1 [9-Dec-2000 { simple-table-extract should handle TH now. Reorganised script and added comments to make it more readable. } "Brett Handley"] 1.0.2 [18-Mar-2001 { Added a more parse friendly dialect. Showed how the dialect can be parsed. Created a simple interpreter that generates the html from the dialect - seems much quicker than the original form-html. I'm having real troubles coming up with a decent naming convention for all these type of things. } "Brett Handley"] ] Comment: { This structure makes querying/changing html a bit easier. Requires the scripts tag-tool.r and highfun.r This structure makes no attempt at interpreting HTML structure it merely encodes it as a stream of more accessible elements than straight tags. Some of the functions below do simplistic interpretation of the HTML structure as represented by my encoding in order to achieve some specific tasks. The encoding uses blocks to represent tags, everything else is strings. Replacing tag! with block! types makes the tag attributes easily accessible. Comments are tags therefore they become blocks. But they are not further interpreted so you end up with a block with a single element of type string. While this encoding is simple, you may be suprised at how much you can achieve with it. A by-product of import-tag is that all attributes become quoted. So the following simple line will "automatically" fix all your unquoted html attributes. write %some-file.html form-html load-html %some-file.html One day I might add a simplistic pretty-up function to it. I've not stressed it, nor tested it with CGI. You are responsible for how you use this script not me! The same goes for any script this one relies on. ----- Need to add entity replace stuff. eg. & --> & ; Example of querying and manipulation of HTML pages. ; ; This example generates a new html page and lists all the ; href type links it can find in the home page of Rebol. ; test-block: [[html] [body] [p] "List of links" [/p] [/body] [/html]] ; Set some attributes - could have put them in blocks instead... set-attribute-value test-block/2 'bgcolor "gray" set-attribute-value test-block/3 'align "center" ; Find out where we want to insert new content into our new page. insert-here: find/only test-block [/p] ; Foreach href type tag we find in the home page add it to our new page. foreach link unique tags/attribute load-html http://www.rebol.com/ 'href [ insert-here: insert insert-here reduce [newline
link (select link 'href) ] ] ; Finally write out the new html page. write %tf.html form-html test-block } ] if not :use-script [use-script: :do] ; Use-script is my script manager use-script %highfun.r ; See http://www.rebol.org/advanced/highfun.r to download. use-script %tag-tool.r ; Available on my site. use-script %import-rebol-datatypes.r ; Available on my site. ; ; Helper functions - used for querying and manipulating the structure after it ; has been loaded. ; get-lookup-value: func[ block lookup-name ][ select block lookup-name ] set-lookup-value: func[ block attr-name attr-value ][ if not get-lookup-value block attr-name [ append block reduce [ attr-name none] ] head change/only next attr-pos: find block attr-name attr-value if none? attr-value [ remove/part attr-pos 2 ] block ] get-attribute-value: func[ "Get the value of a tag attribute." tag-block attr-name ][ get-lookup-value tag-block attr-name ] set-attribute-value: function [ "Set the value of a tag attribute." tag-block attr-name attr-value ][attr-pos][ set-lookup-value tag-block attr-name attr-value ] ; ; Calculated types (assumes results of import-tag) ; element-type: function [ "Returns the type of element." element-value ][][ either block? element-value [ either string? element-value/1 [ RETURN 'uninterpreted-tag ] [ either equal? first to-string element-value #"/" [ RETURN 'close-tag ][ either any [ equal? to-string last element-value "/" find [meta br img] first element-value ] [ RETURN 'unpaired-tag ][ RETURN 'open-tag ] ] ] ] [ RETURN 'char-data ] ] ; ; Encoding markup into my structure. ; As you can see it is pretty basic because it leverages ; the work of "import-tag". Tags are converted to blocks. ; load-html: function [ "Loads html into a simple dialect using blocks to represent tags." html ] [html-markup cursor] [ cursor: load/markup html while [not tail? cursor] [ if tag? cursor/1 [ change/only cursor import-tag first cursor ] cursor: next cursor ] head cursor ] ; ; "Decoding" my structure back to markup. ; form-element: function [ item-value ][result temp-value temp-pos item-type][ item-type: element-type item-value if equal? item-type 'char-data [ RETURN item-value ] if equal? item-type 'uninterpreted-tag [RETURN to-tag item-value/1] temp-value: build-tag item-value if equal? item-type 'unpaired-tag [ if find/match temp-pos: back back tail temp-value " /" [ remove temp-pos ] ] RETURN temp-value ] form-html: function [ "Forms html from the simple html block structure." html-block /add-lines "Add a line break after each element." ] [result cursor item] [ result: make block! length? html-block cursor: html-block while [not tail? cursor] [ insert tail result item: form-element cursor/1 if all [add-lines not tag? item] [append result "^/"] cursor: next cursor ] to-string result ] ; ; Searching functions ; tags: function [ "Returns a block of all tag blocks - by default excluding comments." html-block /name tag-name "Specific tag type" /attribute attr-name "Match attribute type" /value attr-value "Match attribute value - requires /attribute" /uninterpreted "Includes uninterpreted tags (comment tags)." /index "Return index as result." /position "Return block reference as result." ][x selected result-block][ if all [value not attribute] [ throw "/value requires /attribute for tags function." ] result-block: copy [] for i 1 length? html-block 1 [ x: pick html-block i selected: false if block? x [ either uninterpreted [ selected: true ][ if not string? x/1 [ selected: true ] ] ] if all [selected name] [ selected: (equal? x/1 tag-name) ] if all [selected attribute] [ selected: (select x attr-name) ] if all [selected value] [ selected: (find/match (select x attr-name) attr-value) ] if selected [ insert/only tail result-block either position [ result-block at html-block i ][ either index [i][x] ] ] ] RETURN result-block ] strings: function [ "Returns a block of all string elements." html-block /index "Return index as result." /position "Return block reference as result." /with-trim "Trims each string found." /no-empty "Exclude empty strings from result." ][x result-block][ result-block: copy [] for i 1 length? html-block 1 [ x: pick html-block i if not block? x [ if with-trim [ x: trim copy x ] if any [not no-empty not equal? x ""] [ insert/only tail result-block either position [ result-block at html-block i ][ either index [i][x] ] ] ] ] RETURN result-block ] find-tag: function [ "Finds the first occurrence of the named tag." html-block tag-name /tail "Finds the last occurrence instead." ][cursor][ cursor: html-block found: none while [all [not tail? cursor not found]][ if block? cursor/1 [ if not string? cursor/1 [ if equal? cursor/1/1 tag-name [ found: true ] ] ] if not found [cursor: next cursor] ] either tail? cursor [first reduce [none] ][either tail [next cursor][cursor]] ] ; ; Filtering/changing functions ; strip-lines: func[ "Returns the html-block but with line breaks removed from between tags." ; If you some of 'em back check out form-html/add-lines. html-block ][ filter (func[x][any [block? x equal? x "" not equal? trim copy x ""]]) html-block ] fill-empty-cells: function [ "Inserts an empty string into empty cells - simplifying later operations." html-block ][td-a-indexes td-b-indexes insertion-points][ td-a-indexes: map (func[x][add x 1]) ( union (tags/name/index html-block 'td) (tags/name/index html-block 'th) ) td-b-indexes: union tags/name/index html-block to-lit-word "/td" tags/name/index html-block to-lit-word "/th" insertion-points: head reverse sort intersect td-a-indexes td-b-indexes foreach p insertion-points [ insert at html-block p copy {} ] html-block ] simple-table-extract: function [ "Returns the string data table - nested tables not handled." html-block /number table-index "Specify table number to extract." ][table-start table-end table-block][ if not number [table-index: 1] ; Default is first table. ; Find the start of the table. table-start: html-block for i 1 table-index 1 [ table-start: find-tag/tail table-start 'table ] table-start: back table-start table-end: find-tag/tail table-start to-word "/table" table-block: copy/deep/part table-start table-end strings fill-empty-cells strip-lines table-block ] simple-table-extract2: function [ "Returns the string data table as an array - nested tables not handled." html-block /number table-index "Specify table number to extract." ][table-start table-end table-block row-start row-end next-row-start result-block][ if not number [table-index: 1] ; Default is first table. ; Find the start of the table. table-start: html-block for i 1 table-index 1 [ table-start: find-tag/tail table-start 'table ] table-start: back table-start table-end: find-tag/tail table-start to-word "/table" table-block: copy/deep/part table-start table-end table-block: fill-empty-cells strip-lines table-block result-block: copy [] ; Find the first row row-start: find-tag table-block 'tr if not next-row-start [next-row-start: tail table-block] ; Process each row while [not tail? row-start] [ ; Find the start of the next row next-row-start: find-tag next row-start 'tr if not next-row-start [next-row-start: tail table-block] ; Find the end of the current row row-end: find-tag row-start to-lit-word "/tr" if not row-end [row-end: tail table-block] if lesser? next-row-start row-end [ row-end: next-row-start ] ; Extract the row data append/only result-block strings copy/part row-start row-end ; Advance the cursor row-start: next-row-start ] RETURN result-block ] ; ; Encoding "my structure" to a dialect ; html-dialect: function [ "Describes html in a simple dialect." html-spec [ file! url! string! block!] ][result html-block][ either equal? type? html-spec block! [ html-block: html-spec ][ html-block: load-html html-spec ] result: copy [] foreach e html-block [ insert tail result element-type e either equal? element-type e 'close-tag [ insert/only tail result to-word next to-string first e ][ either find [unpaired-tag open-tag] element-type e [ insert/only tail result first e insert/only tail result next e ][ insert tail result e ] ] ] RETURN result ] html-dialect-parser: context [ _tagname: none _tagattributes: none _chardata: none _unpaired: none html-dialect-rules: [ some [ ['open-tag (_unpaired: false) | 'unpaired-tag (_unpaired: true)] set _tagname word! set _tagattributes block! | 'close-tag set _tagname word! (_tagname: to-word join "/" _tagname) | 'char-data set _chardata string! | 'uninterpreted-tag set _chardata string! ] ] set 'parse-html-dialect func [ data ][ parse data html-dialect-rules ] ] form-html-from-dialect-interpreter: context [ _output: none open-tag: func [ 'name [word!] attributes [block!] ][ insert tail _output build-tag head insert copy attributes name ] unpaired-tag: func [ 'name [word!] attributes [block!] ][ open-tag name attributes ] close-tag: func [ 'name [word!] ][ insert tail _output build-tag reduce [to-word join "/" name] ] uninterpreted-tag: func [ data [string!] ][ insert tail _output build-tag reduce [data] ] char-data: func [ data [string!] ][ insert tail _output data ] interpret: func [ dialect ][ if not parse-html-dialect dialect [ throw "Input doesn't conform to the dialect grammar." ] _output: copy "" do bind/copy dialect 'self RETURN _output ] set 'form-html-from-dialect :interpret ] ; ; Convenience functions ; html-file?: function [ "Determine if an html file, returns the extension or none." test-file [file!] ][file-ext][ either find [%.html %.htm] file-ext: get in import-file test-file 'extension [ file-ext ][ none ] ] html-files?: func ["Returns the html files in a specified dir-spec" dir-spec [file! url!]][ map func[x] [either equal? dir-spec %./ [x][join dir-spec x]] filter :html-file? read dir-spec: dirize dir-spec ]