Class Mechanize
In: lib/mechanize.rb
Parent: Object

The Mechanize library is used for automating interactions with a website. It can follow links and submit forms. Form fields can be populated and submitted. A history of URLs is maintained and can be queried.

Example

  require 'mechanize'
  require 'logger'

  agent = Mechanize.new
  agent.log = Logger.new "mech.log"
  agent.user_agent_alias = 'Mac Safari'

  page = agent.get "http://www.google.com/"
  search_form = page.form_with :name => "f"
  search_form.field_with(:name => "q").value = "Hello"

  search_results = agent.submit search_form
  puts search_results.body

Issues with mechanize

If you think you have a bug with mechanize, but aren‘t sure, please file a ticket at github.com/tenderlove/mechanize/issues

Here are some common problems you may experience with mechanize

Problems connecting to SSL sites

Mechanize defaults to validating SSL certificates using the default CA certificates for your platform. At this time, Windows users do not have integration between the OS default CA certificates and OpenSSL. cert_store explains how to download and use Mozilla‘s CA certificates to allow SSL sites to work.

Problems with content-length

Some sites return an incorrect content-length value. Unlike a browser, mechanize raises an error when the content-length header does not match the response length since it does not know if there was a connection problem or if the mismatch is a server bug.

The error raised, Mechanize::ResponseReadError, can be converted to a parsed Page, File, etc. depending upon the content-type:

  agent = Mechanize.new
  uri = URI 'http://example/invalid_content_length'

  begin
    page = agent.get uri
  rescue Mechanize::ResponseReadError => e
    page = e.force_parse
  end

Classes and Modules

Class Mechanize::Error

Constants

VERSION = '2.5.1'   The version of Mechanize you are using.
AGENT_ALIASES = { 'Mechanize' => "Mechanize/#{VERSION} Ruby/#{ruby_version} (http://github.com/tenderlove/mechanize/)", 'Linux Firefox' => 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.1) Gecko/20100122 firefox/3.6.1', 'Linux Konqueror' => 'Mozilla/5.0 (compatible; Konqueror/3; Linux)', 'Linux Mozilla' => 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030624', 'Mac FireFox' => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6', 'Mac Mozilla' => 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.4a) Gecko/20030401', 'Mac Safari 4' => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_2; de-at) AppleWebKit/531.21.8 (KHTML, like Gecko) Version/4.0.4 Safari/531.21.10', 'Mac Safari' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/534.51.22 (KHTML, like Gecko) Version/5.1.1 Safari/534.51.22', 'Windows IE 6' => 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)', 'Windows IE 7' => 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)', 'Windows IE 8' => 'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727)', 'Windows IE 9' => 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)', 'Windows Mozilla' => 'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6', 'iPhone' => 'Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1C28 Safari/419.3', }   Supported User-Agent aliases for use with user_agent_alias=. The description in parenthesis is for informative purposes and is not part of the alias name.
  • Linux Firefox (3.6.1)
  • Linux Konqueror (3)
  • Linux Mozilla
  • Mac Firefox (3.6)
  • Mac Mozilla
  • Mac Safari (5)
  • Mac Safari 4
  • Mechanize (default)
  • Windows IE 6
  • Windows IE 7
  • Windows IE 8
  • Windows IE 9
  • Windows Mozilla
  • iPhone (3.0)

Example:

  agent = Mechanize.new
  agent.user_agent_alias = 'Mac Safari'

Public Class methods

Creates a new mechanize instance. If a block is given, the created instance is yielded to the block for setting up pre-connection state such as SSL parameters or proxies:

  agent = Mechanize.new do |a|
    a.proxy_host = 'proxy.example'
    a.proxy_port = 8080
  end

History

Methods for navigating and controlling history

Public Instance methods

Equivalent to the browser back button. Returns the previous page visited.

Returns the latest page loaded by Mechanize

The history of this mechanize run

Maximum number of items allowed in the history. The default setting is 50 pages. Note that the size of the history multiplied by the maximum response body size

Sets the maximum number of items allowed in the history to length.

Setting the maximum history length to nil will make the history size unlimited. Take care when doing this, mechanize stores response bodies in memory for pages and in the temporary files directory for other responses. For a long-running mechanize program this can be quite large.

See also the discussion under max_file_buffer=

page()

Alias for current_page

Returns a visited page for the url passed in, otherwise nil

visited_page(url)

Alias for visited?

Hooks

Hooks into the operation of mechanize

Attributes

history_added  [RW]  Callback which is invoked with the page that was added to history.

Public Instance methods

A list of hooks to call before reading response header ‘content-encoding’.

The hook is called with the agent making the request, the URI of the request, the response an IO containing the response body.

A list of hooks to call after retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.

A list of hooks to call before retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.

Requests

Methods for making HTTP requests

Public Instance methods

If the parameter is a string, finds the button or link with the value of the string on the current page and clicks it. Otherwise, clicks the Mechanize::Page::Link object passed in. Returns the page fetched.

DELETE uri with query_params, and setting headers:

  delete('http://example/', {'q' => 'foo'}, {})

GETs uri and writes it to io_or_filename without recording the request in the history. If io_or_filename does not respond to write it will be used as a file name. parameters, referer and headers are used as in get.

By default, if the Content-type of the response matches a Mechanize::File or Mechanize::Page parser, the response body will be loaded into memory before being saved. See pluggable_parser for details on changing this default.

For alternate ways of downloading files see Mechanize::FileSaver and Mechanize::DirectorySaver.

GET the uri with the given request parameters, referer and headers.

The referer may be a URI or a page.

GET url and return only its contents

HEAD uri with query_params and headers:

  head('http://example/', {'q' => 'foo'}, {})

POST to the given uri with the given query. The query is specified by either a string, or a list of key-value pairs represented by a hash or an array of arrays.

Examples:

  agent.post 'http://example.com/', "foo" => "bar"

  agent.post 'http://example.com/', [%w[foo bar]]

  agent.post('http://example.com/', "<message>hello</message>",
             'Content-Type' => 'application/xml')

PUT to uri with entity, and setting headers:

  put('http://example/', 'new content', {'Content-Type' => 'text/plain'})

Makes an HTTP request to url using HTTP method verb. entity is used as the request body, if allowed.

Submits form with an optional button.

Without a button:

  page = agent.get('http://example.com')
  agent.submit(page.forms.first)

With a button:

  agent.submit(page.forms.first, page.forms.first.buttons.first)

Runs given block, then resets the page history as it was before. self is given as a parameter to the block. Returns the value of the block.

Settings

Settings that adjust how mechanize makes HTTP requests including timeouts, keep-alives, compression, redirects and headers.

Attributes

default_encoding  [RW]  A default encoding name used when parsing HTML parsing. When set it is used after any other encoding. The default is nil.
force_default_encoding  [RW]  Overrides the encodings given by the HTTP server and the HTML page with the default_encoding when set to true.
html_parser  [RW]  Default HTML parser for all mechanize instances
  Mechanize.html_parser = Nokogiri::XML
html_parser  [RW]  The HTML parser to be used when parsing documents
keep_alive_time  [RW]  HTTP/1.0 keep-alive time. This is no longer supported by mechanize as it now uses net-http-persistent which only supports HTTP/1.1 persistent connections
log  [RW]  Default logger for all mechanize instances
  Mechanize.log = Logger.new $stderr
pluggable_parser  [R]  The pluggable parser maps a response Content-Type to a parser class. The registered Content-Type may be either a full content type like ‘image/png’ or a media type ‘text’. See Mechanize::PluggableParser for further details.

Example:

  agent.pluggable_parser['application/octet-stream'] = Mechanize::Download
proxy_addr  [R]  The HTTP proxy address
proxy_pass  [R]  The HTTP proxy password
proxy_port  [R]  The HTTP proxy port
proxy_user  [R]  The HTTP proxy username
watch_for_set  [RW]  The value of watch_for_set is passed to pluggable parsers for retrieved content

Public Instance methods

Adds credentials user, pass for uri. If realm is set the credentials are used only for that realm. If realm is not set the credentials become the default for any realm on that URI.

domain and realm are exclusive as NTLM does not follow RFC 2617. If domain is given it is only used for NTLM authentication.

NOTE: These credentials will be used as a default for any challenge exposing your password to disclosure to malicious servers. Use of this method will warn. This method is deprecated and will be removed in mechanize 3.

Sets the user and password as the default credentials to be used for HTTP authentication for any server. The domain is used for NTLM authentication.

basic_auth(user, password, domain = nil)

Alias for auth

Are If-Modified-Since conditional requests enabled?

Disables If-Modified-Since conditional requests (enabled by default)

Replaces the cookie jar with cookie_jar

Returns a list of cookies stored in the cookie jar.

Follow HTML meta refresh and HTTP Refresh headers. If set to +:anywhere+ meta refresh tags outside of the head element will be followed.

Controls following of HTML meta refresh and HTTP Refresh headers in responses.

Follow an HTML meta refresh and HTTP Refresh headers that have no "url=" in the content attribute.

Defaults to false to prevent infinite refresh loops.

Alters the following of HTML meta refresh and HTTP Refresh headers that point to the same page.

follow_redirect?()

Alias for redirect_ok

Is gzip compression of responses enabled?

Disables HTTP/1.1 gzip compression (enabled by default)

Connections that have not been used in this many seconds will be reset.

Sets the idle timeout to idle_timeout. The default timeout is 5 seconds. If you experience "too many connection resets", reducing this value may help.

When set to true mechanize will ignore an EOF during chunked transfer encoding so long as at least one byte was received. Be careful when enabling this as it may cause data loss.

Net::HTTP does not inform mechanize of where in the chunked stream the EOF occurred. Usually it is after the last-chunk but before the terminating CRLF (invalid termination) but it may occur earlier. In the second case your response body may be incomplete.

When set to true mechanize will ignore an EOF during chunked transfer encoding. See ignore_bad_chunking for further details

Are HTTP/1.1 keep-alive connections enabled?

Disable HTTP/1.1 keep-alive connections if enable is set to false. If you are experiencing "too many connection resets" errors setting this to false will eliminate them.

You should first investigate reducing idle_timeout.

The current logger. If no logger has been set Mechanize.log is used.

Sets the logger used by this instance of mechanize

Responses larger than this will be written to a Tempfile instead of stored in memory. The default is 100,000 bytes.

A value of nil disables creation of Tempfiles.

Sets the maximum size of a response body that will be stored in memory to bytes. A value of nil causes all response bodies to be stored in memory.

Note that for Mechanize::Download subclasses, the maximum buffer size multiplied by the number of pages stored in history (controlled by max_history) is an approximate upper limit on the amount of memory Mechanize will use. By default, Mechanize can use up to ~5MB to store response bodies for non-File and non-Page (HTML) responses.

See also the discussion under max_history=

Length of time to wait until a connection is opened in seconds

Sets the connection open timeout to open_timeout

Length of time to wait for data from the server

Sets the timeout for each chunk of data read from the server to read_timeout. A single request may read many chunks of data.

Controls how mechanize deals with redirects. The following values are allowed:

:all, true:All 3xx redirects are followed (default)
:permanent:Only 301 Moved Permanantly redirects are followed
false:No redirects are followed

Sets the mechanize redirect handling policy. See redirect_ok for allowed values

Maximum number of redirections to follow

Sets the maximum number of redirections to follow to limit

A hash of custom request headers that will be sent on every request

Replaces the custom request headers that will be sent on every request with request_headers

Retry POST and other non-idempotent requests. See RFC 2616 9.1.2.

When setting retry_change_requests to true you are stating that, for all the URLs you access with mechanize, making POST and other non-idempotent requests is safe and will not cause data duplication or other harmful results.

If you are experiencing "too many connection resets" errors you should instead investigate reducing the idle_timeout or disabling keep_alive connections.

Will /robots.txt files be obeyed?

When enabled mechanize will retrieve and obey robots.txt files

The handlers for HTTP and other URI protocols.

Replaces the URI scheme handler table with scheme_handlers

The identification string for the client initiating a web request

Sets the User-Agent used by mechanize to user_agent. See also user_agent_alias

Set the user agent for the Mechanize object based on the given name.

See also AGENT_ALIASES

SSL

SSL settings for mechanize. These must be set in the block given to Mechanize.new

Public Instance methods

Path to an OpenSSL server certificate file

Sets the certificate file used for SSL connections

An OpenSSL client certificate or the path to a certificate file.

Sets the OpenSSL client certificate cert to the given path or certificate instance

An OpenSSL certificate store for verifying server certificates. This defaults to the default certificate store for your system.

If your system does not ship with a default set of certificates you can retrieve a copy of the set from Mozilla here: curl.haxx.se/docs/caextract.html

(Note that this set does not have an HTTPS download option so you may wish to use the firefox-db2pem.sh script to extract the certificates from a local install to avoid man-in-the-middle attacks.)

After downloading or generating a cacert.pem from the above link you can create a certificate store from the pem file like this:

  cert_store = OpenSSL::X509::Store.new
  cert_store.add_file 'cacert.pem'

And have mechanize use it with:

  agent.cert_store = cert_store

Sets the OpenSSL certificate store to store.

See also cert_store

An OpenSSL private key or the path to a private key

Sets the OpenSSL client key to the given path or key instance. If a path is given, the path must contain an RSA key file.

OpenSSL client key password

Sets the client key password to pass

SSL version to use. Ruby 1.9 and newer only.

Sets the SSL version to use to version without client/server negotiation. Ruby 1.9 and newer only.

A callback for additional certificate verification. See OpenSSL::SSL::SSLContext#verify_callback

The callback can be used for debugging or to ignore errors by always returning true. Specifying nil uses the default method that was valid when the SSLContext was created

Sets the OpenSSL certificate verification callback

the OpenSSL server certificate verification method. The default is OpenSSL::SSL::VERIFY_PEER and certificate verification uses the default system certificates. See also cert_store

Sets the OpenSSL server certificate verification method.